A Refresher on Clustering Analysis – Part II

stats
Reproduce
Author

Zhenglei Gao

Published

March 12, 2025

Clustering Algorithms: Pros and Cons Comparison

Algorithm Pros Cons Best Use Cases
K-means • Fast and efficient (O(n))
• Easy to implement
• Scales well to large datasets
• Requires pre-specifying k
• Sensitive to outliers
• Limited to spherical clusters
• Euclidean distance only
• Initial centroid dependent
Datasets with well-separated, globular clusters of similar size
PAM • More robust to outliers than k-means
• Works with any distance metric
• Better for categorical/mixed data
• Computationally expensive (O(k(n-k)²))
• Requires pre-specifying k
• Slower than k-means
Datasets with outliers where medoids are preferred over means
Hierarchical - Single Linkage • Detects elongated clusters
• No need to specify k beforehand
• Provides cluster hierarchy
• Chaining effect
• Poor with noise
• O(n²) space complexity
Detecting non-elliptical clusters, finding outliers
Hierarchical - Complete Linkage • Creates compact clusters
• Less susceptible to noise than single linkage
• Tends to break large clusters
• Sensitive to outliers
• O(n²) space complexity
When compact, evenly sized clusters are desired
Hierarchical - Average Linkage • Compromise between single/complete
• More robust than other linkages
• Computationally intensive
• Still O(n²) complexity
General-purpose hierarchical clustering with moderate outliers
Hierarchical - Ward’s • Minimizes within-cluster variance
• Creates similar-sized clusters
• Sensitive to outliers
• Tends toward spherical clusters
When minimum variance clusters are important
DBSCAN • No need to specify number of clusters
• Finds arbitrary shapes
• Identifies outliers as noise
• Handles varying cluster sizes
• Sensitive to parameters (eps, minPts)
• Struggles with varying densities
• Difficulty with high dimensions
Spatial data, datasets with noise and non-globular clusters

Critical Differences:

  1. Specifying clusters: K-means and PAM require pre-defined k; hierarchical methods and DBSCAN do not
  2. Outlier handling: DBSCAN explicitly identifies outliers; PAM is moderately robust; k-means is highly sensitive
  3. Cluster shapes: K-means/PAM find spherical clusters; hierarchical varies by linkage; DBSCAN finds arbitrary shapes
  4. Scalability: K-means scales best to large datasets; hierarchical methods and PAM are more computationally intensive
  5. Distance metrics: K-means limited to Euclidean; others can use various distance measures

Clustering Algorithm Similarities: Relationship Map

Primary Algorithm Relationships

  1. K-means ↔︎ PAM: Most closely related pair
    • Both are partitioning algorithms requiring pre-specified k
    • Both iteratively optimize cluster representatives
    • PAM is essentially a robust variant of k-means using medoids instead of centroids
    • Both create flat, non-hierarchical clusters of similar sizes
  2. Hierarchical Variants: Form their own family
    • All hierarchical methods (single, complete, average, Ward’s) share the same dendrogram-building approach
    • They differ only in how inter-cluster distances are calculated
    • All provide multi-level cluster views rather than a single partitioning
  3. Ward’s Linkage ↔︎ K-means: Notable cross-family similarity
    • Both minimize within-cluster variance
    • Ward’s can be viewed as a hierarchical version of the objective that k-means optimizes
    • Both tend to create spherical, similarly-sized clusters
  4. DBSCAN: Most distinct from others
    • Density-based rather than centroid-based or hierarchical
    • Conceptually different approach focusing on connected dense regions
    • Only algorithm that natively identifies noise points

Grouping by Core Characteristics

Characteristic Algorithms
Partitioning Methods K-means, PAM
Hierarchical Methods Single, Complete, Average, Ward’s linkages
Density-Based Methods DBSCAN
Requires Pre-specified k K-means, PAM
Can Handle Arbitrary Shapes Single linkage, DBSCAN
Variance-Minimizing K-means, Ward’s linkage
Uses Actual Data Points as Representatives PAM, Hierarchical methods

The algorithms can be broadly categorized into three conceptual families (partitioning, hierarchical, and density-based), with the strongest similarities occurring within these families and certain cross-family similarities based on specific optimization objectives.

Optimal Cluster Number (k) Determination Methods

How to determine optimal k?

Summary Table of Criteria by Clustering Method

Clustering Method Criteria for Optimal k How to Choose k Strengths Limitations
K-means • Elbow method (WCSS)
• Silhouette score
• Gap statistic
• Calinski-Harabasz index
• Davies-Bouldin index
• BIC/AIC
Select k where:
• Elbow point occurs in WCSS plot
• Silhouette score maximizes
• Gap statistic maximizes
• CH index maximizes
• DB index minimizes
• Elbow: Simple, intuitive
• Silhouette: Measures both cohesion & separation
• Gap: Statistical foundation
• Elbow: Subjective interpretation
• Multiple criteria often disagree
• Different criteria favor different cluster characteristics
PAM • Silhouette width
• Gap statistic
• Calinski-Harabasz index
• Davies-Bouldin index
• PAM-specific GOF metrics
Same as k-means, but using medoid-based calculations rather than centroid-based • More robust to outliers than k-means metrics
• Works with arbitrary distance metrics
• Computationally more expensive
• Still subjective when criteria disagree
Hierarchical • Dendrogram inspection
• Inconsistency coefficient
• Cophenetic correlation
• Height difference analysis
• Standard indices after cutting
• Cut dendrogram where largest vertical gap occurs
• Choose k where inconsistency coefficient shows significant jump
• Apply standard indices after cutting at different levels
• Visual validation via dendrogram
• Provides multi-level perspective
• Height differences show natural divisions
• Dendrogram interpretation is subjective
• Large datasets produce complex dendrograms
• Different linkage methods yield different results
DBSCAN • k-distance graph
• DBCV (Density-Based Clustering Validation)
• Silhouette on resulting clusters
• Stability measures
• Plot distances to kth nearest neighbor
• Find “elbow” for eps parameter
• Vary eps/minPts and evaluate resulting cluster counts
• Doesn’t directly require k specification
• Parameter selection more intuitive for density-based approach
• Focuses on eps/minPts rather than k directly
• Resulting cluster count highly sensitive to parameters