A Refresher on Clustering Analysis – Part II

stats

Reproduce

Author

Zhenglei Gao

Published

March 12, 2025

Clustering Algorithms: Pros and Cons Comparison

Algorithm	Pros	Cons	Best Use Cases
K-means	• Fast and efficient (O(n)) • Easy to implement • Scales well to large datasets	• Requires pre-specifying k • Sensitive to outliers • Limited to spherical clusters • Euclidean distance only • Initial centroid dependent	Datasets with well-separated, globular clusters of similar size
PAM	• More robust to outliers than k-means • Works with any distance metric • Better for categorical/mixed data	• Computationally expensive (O(k(n-k)²)) • Requires pre-specifying k • Slower than k-means	Datasets with outliers where medoids are preferred over means
Hierarchical - Single Linkage	• Detects elongated clusters • No need to specify k beforehand • Provides cluster hierarchy	• Chaining effect • Poor with noise • O(n²) space complexity	Detecting non-elliptical clusters, finding outliers
Hierarchical - Complete Linkage	• Creates compact clusters • Less susceptible to noise than single linkage	• Tends to break large clusters • Sensitive to outliers • O(n²) space complexity	When compact, evenly sized clusters are desired
Hierarchical - Average Linkage	• Compromise between single/complete • More robust than other linkages	• Computationally intensive • Still O(n²) complexity	General-purpose hierarchical clustering with moderate outliers
Hierarchical - Ward’s	• Minimizes within-cluster variance • Creates similar-sized clusters	• Sensitive to outliers • Tends toward spherical clusters	When minimum variance clusters are important
DBSCAN	• No need to specify number of clusters • Finds arbitrary shapes • Identifies outliers as noise • Handles varying cluster sizes	• Sensitive to parameters (eps, minPts) • Struggles with varying densities • Difficulty with high dimensions	Spatial data, datasets with noise and non-globular clusters

Critical Differences:

Specifying clusters: K-means and PAM require pre-defined k; hierarchical methods and DBSCAN do not
Outlier handling: DBSCAN explicitly identifies outliers; PAM is moderately robust; k-means is highly sensitive
Cluster shapes: K-means/PAM find spherical clusters; hierarchical varies by linkage; DBSCAN finds arbitrary shapes
Scalability: K-means scales best to large datasets; hierarchical methods and PAM are more computationally intensive
Distance metrics: K-means limited to Euclidean; others can use various distance measures

Clustering Algorithm Similarities: Relationship Map

Primary Algorithm Relationships

K-means ↔︎ PAM: Most closely related pair
- Both are partitioning algorithms requiring pre-specified k
- Both iteratively optimize cluster representatives
- PAM is essentially a robust variant of k-means using medoids instead of centroids
- Both create flat, non-hierarchical clusters of similar sizes
Hierarchical Variants: Form their own family
- All hierarchical methods (single, complete, average, Ward’s) share the same dendrogram-building approach
- They differ only in how inter-cluster distances are calculated
- All provide multi-level cluster views rather than a single partitioning
Ward’s Linkage ↔︎ K-means: Notable cross-family similarity
- Both minimize within-cluster variance
- Ward’s can be viewed as a hierarchical version of the objective that k-means optimizes
- Both tend to create spherical, similarly-sized clusters
DBSCAN: Most distinct from others
- Density-based rather than centroid-based or hierarchical
- Conceptually different approach focusing on connected dense regions
- Only algorithm that natively identifies noise points

Grouping by Core Characteristics

Characteristic	Algorithms
Partitioning Methods	K-means, PAM
Hierarchical Methods	Single, Complete, Average, Ward’s linkages
Density-Based Methods	DBSCAN
Requires Pre-specified k	K-means, PAM
Can Handle Arbitrary Shapes	Single linkage, DBSCAN
Variance-Minimizing	K-means, Ward’s linkage
Uses Actual Data Points as Representatives	PAM, Hierarchical methods

The algorithms can be broadly categorized into three conceptual families (partitioning, hierarchical, and density-based), with the strongest similarities occurring within these families and certain cross-family similarities based on specific optimization objectives.

Optimal Cluster Number (k) Determination Methods

How to determine optimal k?

Summary Table of Criteria by Clustering Method

Clustering Method	Criteria for Optimal k	How to Choose k	Strengths	Limitations
K-means	• Elbow method (WCSS) • Silhouette score • Gap statistic • Calinski-Harabasz index • Davies-Bouldin index • BIC/AIC	Select k where: • Elbow point occurs in WCSS plot • Silhouette score maximizes • Gap statistic maximizes • CH index maximizes • DB index minimizes	• Elbow: Simple, intuitive • Silhouette: Measures both cohesion & separation • Gap: Statistical foundation	• Elbow: Subjective interpretation • Multiple criteria often disagree • Different criteria favor different cluster characteristics
PAM	• Silhouette width • Gap statistic • Calinski-Harabasz index • Davies-Bouldin index • PAM-specific GOF metrics	Same as k-means, but using medoid-based calculations rather than centroid-based	• More robust to outliers than k-means metrics • Works with arbitrary distance metrics	• Computationally more expensive • Still subjective when criteria disagree
Hierarchical	• Dendrogram inspection • Inconsistency coefficient • Cophenetic correlation • Height difference analysis • Standard indices after cutting	• Cut dendrogram where largest vertical gap occurs • Choose k where inconsistency coefficient shows significant jump • Apply standard indices after cutting at different levels	• Visual validation via dendrogram • Provides multi-level perspective • Height differences show natural divisions	• Dendrogram interpretation is subjective • Large datasets produce complex dendrograms • Different linkage methods yield different results
DBSCAN	• k-distance graph • DBCV (Density-Based Clustering Validation) • Silhouette on resulting clusters • Stability measures	• Plot distances to kth nearest neighbor • Find “elbow” for eps parameter • Vary eps/minPts and evaluate resulting cluster counts	• Doesn’t directly require k specification • Parameter selection more intuitive for density-based approach	• Focuses on eps/minPts rather than k directly • Resulting cluster count highly sensitive to parameters

When Criteria Disagree: Recommended Approach

Primary Reliance (in order):
- Domain knowledge and business context
- Silhouette score (balances cohesion and separation)
- Stability-based measures (consistent results across samples)
- Visual validation when possible
Ensemble Approach:
- Calculate multiple indices and look for consensus
- Weight criteria based on clustering goals (separation vs. balance)
- Use majority voting among top-performing criteria
Method-Specific Recommendations:
- K-means: Silhouette + Gap statistic, with visual elbow confirmation
- PAM: Average silhouette width, especially with non-Euclidean distances
- Hierarchical: Dendrogram inspection + inconsistency coefficient
- DBSCAN: k-distance plot for eps + validation on resulting clusters
Practical Implementation:
- Run algorithms with k ranging from 2 to \(\sqrt{n/2}\)
- Plot all criteria on a standardized scale
- Look for agreement points or stable regions
- Consider the purpose of clustering in final decision

Remember that the “best” k often depends on the specific goals of your clustering task - whether you need fine-grained clusters, balanced sizes, or maximum separation between groups.