Clustering Algorithms: Pros and Cons Comparison
K-means |
• Fast and efficient (O(n)) • Easy to implement • Scales well to large datasets |
• Requires pre-specifying k • Sensitive to outliers • Limited to spherical clusters • Euclidean distance only • Initial centroid dependent |
Datasets with well-separated, globular clusters of similar size |
PAM |
• More robust to outliers than k-means • Works with any distance metric • Better for categorical/mixed data |
• Computationally expensive (O(k(n-k)²)) • Requires pre-specifying k • Slower than k-means |
Datasets with outliers where medoids are preferred over means |
Hierarchical - Single Linkage |
• Detects elongated clusters • No need to specify k beforehand • Provides cluster hierarchy |
• Chaining effect • Poor with noise • O(n²) space complexity |
Detecting non-elliptical clusters, finding outliers |
Hierarchical - Complete Linkage |
• Creates compact clusters • Less susceptible to noise than single linkage |
• Tends to break large clusters • Sensitive to outliers • O(n²) space complexity |
When compact, evenly sized clusters are desired |
Hierarchical - Average Linkage |
• Compromise between single/complete • More robust than other linkages |
• Computationally intensive • Still O(n²) complexity |
General-purpose hierarchical clustering with moderate outliers |
Hierarchical - Ward’s |
• Minimizes within-cluster variance • Creates similar-sized clusters |
• Sensitive to outliers • Tends toward spherical clusters |
When minimum variance clusters are important |
DBSCAN |
• No need to specify number of clusters • Finds arbitrary shapes • Identifies outliers as noise • Handles varying cluster sizes |
• Sensitive to parameters (eps, minPts) • Struggles with varying densities • Difficulty with high dimensions |
Spatial data, datasets with noise and non-globular clusters |
Critical Differences:
- Specifying clusters: K-means and PAM require pre-defined k; hierarchical methods and DBSCAN do not
- Outlier handling: DBSCAN explicitly identifies outliers; PAM is moderately robust; k-means is highly sensitive
- Cluster shapes: K-means/PAM find spherical clusters; hierarchical varies by linkage; DBSCAN finds arbitrary shapes
- Scalability: K-means scales best to large datasets; hierarchical methods and PAM are more computationally intensive
- Distance metrics: K-means limited to Euclidean; others can use various distance measures
Clustering Algorithm Similarities: Relationship Map
Primary Algorithm Relationships
- K-means ↔︎ PAM: Most closely related pair
- Both are partitioning algorithms requiring pre-specified k
- Both iteratively optimize cluster representatives
- PAM is essentially a robust variant of k-means using medoids instead of centroids
- Both create flat, non-hierarchical clusters of similar sizes
- Hierarchical Variants: Form their own family
- All hierarchical methods (single, complete, average, Ward’s) share the same dendrogram-building approach
- They differ only in how inter-cluster distances are calculated
- All provide multi-level cluster views rather than a single partitioning
- Ward’s Linkage ↔︎ K-means: Notable cross-family similarity
- Both minimize within-cluster variance
- Ward’s can be viewed as a hierarchical version of the objective that k-means optimizes
- Both tend to create spherical, similarly-sized clusters
- DBSCAN: Most distinct from others
- Density-based rather than centroid-based or hierarchical
- Conceptually different approach focusing on connected dense regions
- Only algorithm that natively identifies noise points
Grouping by Core Characteristics
Partitioning Methods |
K-means, PAM |
Hierarchical Methods |
Single, Complete, Average, Ward’s linkages |
Density-Based Methods |
DBSCAN |
Requires Pre-specified k |
K-means, PAM |
Can Handle Arbitrary Shapes |
Single linkage, DBSCAN |
Variance-Minimizing |
K-means, Ward’s linkage |
Uses Actual Data Points as Representatives |
PAM, Hierarchical methods |
The algorithms can be broadly categorized into three conceptual families (partitioning, hierarchical, and density-based), with the strongest similarities occurring within these families and certain cross-family similarities based on specific optimization objectives.
Optimal Cluster Number (k) Determination Methods
How to determine optimal k?
Summary Table of Criteria by Clustering Method
K-means |
• Elbow method (WCSS) • Silhouette score • Gap statistic • Calinski-Harabasz index • Davies-Bouldin index • BIC/AIC |
Select k where: • Elbow point occurs in WCSS plot • Silhouette score maximizes • Gap statistic maximizes • CH index maximizes • DB index minimizes |
• Elbow: Simple, intuitive • Silhouette: Measures both cohesion & separation • Gap: Statistical foundation |
• Elbow: Subjective interpretation • Multiple criteria often disagree • Different criteria favor different cluster characteristics |
PAM |
• Silhouette width • Gap statistic • Calinski-Harabasz index • Davies-Bouldin index • PAM-specific GOF metrics |
Same as k-means, but using medoid-based calculations rather than centroid-based |
• More robust to outliers than k-means metrics • Works with arbitrary distance metrics |
• Computationally more expensive • Still subjective when criteria disagree |
Hierarchical |
• Dendrogram inspection • Inconsistency coefficient • Cophenetic correlation • Height difference analysis • Standard indices after cutting |
• Cut dendrogram where largest vertical gap occurs • Choose k where inconsistency coefficient shows significant jump • Apply standard indices after cutting at different levels |
• Visual validation via dendrogram • Provides multi-level perspective • Height differences show natural divisions |
• Dendrogram interpretation is subjective • Large datasets produce complex dendrograms • Different linkage methods yield different results |
DBSCAN |
• k-distance graph • DBCV (Density-Based Clustering Validation) • Silhouette on resulting clusters • Stability measures |
• Plot distances to kth nearest neighbor • Find “elbow” for eps parameter • Vary eps/minPts and evaluate resulting cluster counts |
• Doesn’t directly require k specification • Parameter selection more intuitive for density-based approach |
• Focuses on eps/minPts rather than k directly • Resulting cluster count highly sensitive to parameters |
When Criteria Disagree: Recommended Approach
- Primary Reliance (in order):
- Domain knowledge and business context
- Silhouette score (balances cohesion and separation)
- Stability-based measures (consistent results across samples)
- Visual validation when possible
- Ensemble Approach:
- Calculate multiple indices and look for consensus
- Weight criteria based on clustering goals (separation vs. balance)
- Use majority voting among top-performing criteria
- Method-Specific Recommendations:
- K-means: Silhouette + Gap statistic, with visual elbow confirmation
- PAM: Average silhouette width, especially with non-Euclidean distances
- Hierarchical: Dendrogram inspection + inconsistency coefficient
- DBSCAN: k-distance plot for eps + validation on resulting clusters
- Practical Implementation:
- Run algorithms with k ranging from 2 to \(\sqrt{n/2}\)
- Plot all criteria on a standardized scale
- Look for agreement points or stable regions
- Consider the purpose of clustering in final decision
Remember that the “best” k often depends on the specific goals of your clustering task - whether you need fine-grained clusters, balanced sizes, or maximum separation between groups.