Clustering techniques

Ahsan Ijaz

Clustering Applications

Structure information together

Clustering Applications

Structure information together

Clustering

Unsupervised learning task
No labels provided
Uncovering structure by using only input
Input : docs as a vector \(x_i\)
Output : cluster labels \(z_i\)

Characteristics of a cluster

Center and shape
Assign observation \(x_i\) (doc) under cluster k (topic label) if
- Score under cluster k is higher than others
- Can think of distance from cluster center as a metric

Ambiguous clusters

K-means clustering

Partitional clustering approach
Each cluster is associated with a centroid
Number of cluster, K are specified.

Select K points as initial centroids.
repeat
Form K clusters by assigning all points to the closest center
Recompute the centroids of each cluster.
Until the centroids don't change.

Illustrating k-means

K-mean clustering details

Initial centroids are often chosen randomly.
Clusters produced vary from one run to another.
‘Closeness’ is measured by a proximity measure e.g., Euclidean distance, cosine similarity, or correlation.
K-means will converge for common proximity measures mentioned above.
Most of the convergence happens in the first few iterations.

Importance of choosing initial clusters

K-mean clustering can depend heavily on the initial cluster centroids.

Example of 10 clusters (initial centroid)

Example of 10 clusters

Limitations of K-means (differing sizes)

Limitations of K-means (differing density)

Limitations of K-means (nonglobular shapes)

Other examples of limitations

Density based clustering

Locate regions of high density separated from one another by region of low density.

Notion of density

Euclidean density
- Number of points per unit volume
Probability density

Cell-based Euclidean density

Divide region into a number of rectangular cells of equal volume and define density as # of points in each cell.

Center-based Euclidean density

Number of points with a specified radius of the point

DBSCAN

Density = number of points within a specified radius (Eps)
A point is a core point if it has more than a specified number of neighbors (MinPts) within Eps.
- These are points that are at the interior of a cluster.
A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point.
A noise point is any point that is not a core point nor a border point.

DBSCAN: Core, border and noise points

DBSCAN Algorithm

Eliminate noise points
Perform clustering on the remaining points

Label all points as core, border or noise points.
Eliminate noise points.
Put an edge between all core points that are within Eps of each other.
Make each group of connected points into one cluster.
Assign each border point to one of the clusters of its associated core points.

When DBSCAN works well

Resistant to noise
Can handle different shapes and sizes

When DBSCAN fails

Varying densities
High dimensional data