-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Labels
enhancementNew feature or requestNew feature or request
Description
📊 Feature Request: compare_silhouette_scores Method for Cluster Number Evaluation
Summary
Add a new method compare_silhouette_scores to the unsupervised module of the DataScienceUtils package. This method will help users determine the optimal number of clusters by running clustering over a specified range of cluster counts and plotting the corresponding average silhouette scores.
Motivation
Selecting the right number of clusters (k) is a critical step in clustering analysis. The silhouette score is a popular metric for assessing cluster cohesion and separation. Automating the process of:
- Running clustering algorithms across multiple k values,
- Computing silhouette scores for each k,
- Visualizing the silhouette score trend,
greatly simplifies hyperparameter tuning and improves the data scientist's workflow.
API Proposal and Full Code
import numpy as np
import matplotlib.pyplot as plt
from typing import Union, Tuple, Any
from sklearn.metrics import silhouette_score
def compare_silhouette_scores(
X: Union[np.ndarray, 'pd.DataFrame'],
k_range: range = range(2, 11),
algorithm: Any = None,
algorithm_params: dict = None,
figsize: Tuple[int, int] = (10, 6)
) -> plt.Figure:
"""
Compare silhouette scores across different numbers of clusters.
Parameters:
X (array-like or pd.DataFrame): Input data for clustering.
k_range (range): Range of cluster numbers to evaluate (default: 2 to 10).
algorithm (class): Clustering algorithm class (must implement fit_predict). Defaults to KMeans.
algorithm_params (dict): Parameters for clustering algorithm initialization.
figsize (Tuple[int, int]): Size of the matplotlib figure.
Returns:
plt.Figure: Matplotlib Figure object showing silhouette scores vs number of clusters.
"""
from sklearn.cluster import KMeans
if algorithm is None:
algorithm = KMeans
if algorithm_params is None:
algorithm_params = {'random_state': 42, 'n_init': 10}
silhouette_scores = []
k_values = list(k_range)
for k in k_values:
clusterer = algorithm(n_clusters=k, **algorithm_params)
cluster_labels = clusterer.fit_predict(X)
score = silhouette_score(X, cluster_labels)
silhouette_scores.append(score)
fig, ax = plt.subplots(figsize=figsize)
ax.plot(k_values, silhouette_scores, 'bo-', linewidth=2, markersize=8)
ax.set_xlabel('Number of Clusters (k)', fontsize=12)
ax.set_ylabel('Average Silhouette Score', fontsize=12)
ax.set_title('Silhouette Score vs Number of Clusters', fontsize=14)
ax.grid(True, alpha=0.3)
best_k = k_values[np.argmax(silhouette_scores)]
best_score = max(silhouette_scores)
ax.axvline(x=best_k, color='red', linestyle='--', alpha=0.7)
ax.text(best_k, best_score + 0.02, f'Best: k={best_k}\nScore={best_score:.3f}',
ha='center', va='bottom', fontsize=10,
bbox=dict(boxstyle="round,pad=0.3", facecolor="yellow", alpha=0.7))
for k, score in zip(k_values, silhouette_scores):
ax.text(k, score - 0.02, f'{score:.3f}', ha='center', va='top', fontsize=9)
plt.tight_layout()
return figTo Do
- Implement compare_silhouette_scores as described above in unsupervised.py.
- Write unit tests to verify:
* Correct calculation of silhouette scores across k values.
* Handling of default and custom clustering algorithms. - Update documentation with usage examples.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request