Clustering techniques

Ahsan Ijaz

Clustering Applications

  • Structure information together

Clustering Applications

  • Structure information together

Clustering

  • Unsupervised learning task
  • No labels provided
  • Uncovering structure by using only input
  • Input : docs as a vector \(x_i\)
  • Output : cluster labels \(z_i\)

Characteristics of a cluster

  • Center and shape
  • Assign observation \(x_i\) (doc) under cluster k (topic label) if
    • Score under cluster k is higher than others
    • Can think of distance from cluster center as a metric

Ambiguous clusters

K-means clustering

  • Partitional clustering approach
  • Each cluster is associated with a centroid
  • Number of cluster, K are specified.
  1. Select K points as initial centroids.
  2. repeat
  3. Form K clusters by assigning all points to the closest center
  4. Recompute the centroids of each cluster.
  5. Until the centroids don't change.

Illustrating k-means

K-mean clustering details

  • Initial centroids are often chosen randomly.
  • Clusters produced vary from one run to another.
  • ‘Closeness’ is measured by a proximity measure e.g., Euclidean distance, cosine similarity, or correlation.
  • K-means will converge for common proximity measures mentioned above.
  • Most of the convergence happens in the first few iterations.

Importance of choosing initial clusters

K-mean clustering can depend heavily on the initial cluster centroids.

Example of 10 clusters (initial centroid)

Example of 10 clusters

Limitations of K-means (differing sizes)

Limitations of K-means (differing density)

Limitations of K-means (nonglobular shapes)

Other examples of limitations

Density based clustering

Locate regions of high density separated from one another by region of low density.

Notion of density

  • Euclidean density
    • Number of points per unit volume
  • Probability density

Cell-based Euclidean density

  • Divide region into a number of rectangular cells of equal volume and define density as # of points in each cell.

Center-based Euclidean density

  • Number of points with a specified radius of the point

DBSCAN

  • Density = number of points within a specified radius (Eps)
  • A point is a core point if it has more than a specified number of neighbors (MinPts) within Eps.
    • These are points that are at the interior of a cluster.
  • A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point.
  • A noise point is any point that is not a core point nor a border point.

DBSCAN: Core, border and noise points

DBSCAN Algorithm

  • Eliminate noise points
  • Perform clustering on the remaining points
  1. Label all points as core, border or noise points.
  2. Eliminate noise points.
  3. Put an edge between all core points that are within Eps of each other.
  4. Make each group of connected points into one cluster.
  5. Assign each border point to one of the clusters of its associated core points.

When DBSCAN works well

  • Resistant to noise
  • Can handle different shapes and sizes

When DBSCAN fails

  • Varying densities
  • High dimensional data