A bit of theory

The barcode of a hierarchical clustering

Barcodes are a visualization tool for data analysis that play an important role in persistent homology. This tutorial explains how barcodes provide a simple visualization of a hierarchical clustering.

_images/hierarchical-clustering-barcode.png

A (1-parameter) hierarchical clustering of a dataset X is a parameterized family of clusterings H(X) of X. Above (in black) is a schematic picture of a hierarchical clustering.

For real values r in some interval of the real line, we have a clustering H(X)r of the data, and as the parameter r grows, the clustering can change in the following ways:

  1. A new cluster can be added to the hierarchy (the triangles in the diagram).

  2. Two clusters can merge (the dots).

  3. The hierarchical clustering can end (the square).

  4. New data points can be added to a cluster (not shown in the diagram).

An example of such a hierarchical clustering is the output of the single-linkage algorithm. Persistable constructs hierarchical clusterings using graphs that encode both spatial relations among data points and density (see Introduction to Persistable).

For a rigorous definition of hierarchical clustering, see Section 2 of Stable and consistent density-based clustering.

The barcode of a hierarchical clustering H(X) consists of bars, which are simply intervals of the real line. In the picture above, the barcode is displayed in green. It can be constructed as follows:

  1. When a new cluster enters H(X) at parameter r, start a new bar with left endpoint r.

  2. If two clusters merge at parameter r, take the cluster that entered the hierarchy at the larger parameter value, and end its bar, by setting the right endpoint to be r.

  3. If the hierarchical clustering ends at parameter r, set the right endpoint of all bars that do not yet have a right endpoint to r.

Rule 2 is called the Elder rule, since the elder bar (that entered the hierarchy first), survives, while the younger bar dies.

For a rigorous description of this algorithm, see Section 5 of Stable and consistent density-based clustering (in particular, Algorithm 1).

The barcode of a hierarchical clustering forgets some information. In particular, when two clusters merge, the barcode does not record which two clusters merged. This forgetfulness makes the barcode quite simple, which makes it easy to read even for fairly complicated hierarchical clusterings. See Clustering with Persistable to see how we use the barcode of a hierarchical clustering to choose parameters for Persistable.

The persistence-based flattening algorithm

Given a hierarchical clustering of data, the persistence-based flattening algorithm extracts a single clustering of the data. We call this a “flattening” algorithm because it reduces a 1-parameter hierarchical clustering to a flat (0-parameter) clustering.

This tutorial provides a short description of the persistence-based flattening algorithm. For details, see Section 6 of Stable and consistent density-based clustering. This method is inspired by the ToMATo algorithm; see Related tools for density-based clustering.

The term “persistence” comes from the theory of persistent homology, and in particular, we will use the barcode of a hierarchical clustering to describe this algorithm:

_images/hierarchical-clustering-barcode.png

A schematic picture of a hierarchical clustering (in black), and its barcode (in green).

See The barcode of a hierarchical clustering for an introduction to this notion.

A simple flattening algorithm is to just take the leaves of the hierarchical clustering (the blue dashed edges):

_images/hierarchical-clustering-leaves.png

Recall that bars in the barcode are created whenever a new cluster enters the hierarchy, so the bars are in one-to-one correspondence with the leaves of the hierarchical clustering.

To be more explicit, let’s add some data points to this picture:

_images/hierarchical-clustering-all-points.png

The picture represents a hierarchical clustering of the set {a,b,c,d,e}. Point a enters the hierarchy at the parameter r = 0, and point b enters at r=1. At r=1.5, point c enters the hierarchy, and is immediately clustered with a. The clusters {a,c} and {b} merge at r=2.5, and so on.

The clustering of the data we get by taking the leaves has three clusters: {a,c}, {b}, and {d}. The point e is not assigned to any cluster, since it never lived inside a leaf.

A nice property of the leaf clustering is that it can identify cluster structure happening at different scales. In this example, once the point d has entered the hierarchy, points a, b, and c are all clustered together. So, we can’t get the leaf clustering just by fixing a value of r and taking the hierarchical clustering at that value.

However, the leaf clustering can give poor results if the hierarchy has many spurious leaves, as in this example:

_images/hierarchical-clustering-barcode-noise.png

The persistence-based flattening attempts to avoid this problem by first pruning the hierarchy, and then taking the leaf clustering. For this we use the barcode of the hierarchical clustering as a visual guide.

In the example above, we might guess that the three very short bars correspond to spurious leaves. So, we can choose a value t that is greater than the length of the three short bars, and prune the hierarchy by shortening each leaf by t:

_images/hierarchical-clustering-barcode-noise-pruned.png

To see what happens to the data points inside the leaves, consider data points a and b living inside different leaves in the original hierarchical clustering:

_images/hierarchical-clustering-points.png

After pruning, these points are pushed rightwards along the hierarchy, and live inside the same leaf:

_images/hierarchical-clustering-points-pruned.png

Notice that, when we prune before taking the leaf clustering, we’re not just removing low-persistence clusters from the end result. Since these low-persistence leaves can “interrupt” high-persistence leaves, pruning allows larger clusters to form.

As an example, consider the dataset from Introduction to Persistable. There, we considered a hierarchical clustering obtained by taking a slice of the DBSCAN graphs:

_images/component_counting_slice.png

If we choose to keep 7 bars, the persistence-based flattening produces the following result:

_images/hdbscan_data_7_clusters.png

If we choose to keep only 6 bars, the two clusters at the bottom are able to merge:

_images/hdbscan_data_clustered.png