Model Evaluation for Clustering Algorithm

List of different Model Evaluation metrics/techniques for Clustering Algorithm covered in this article.

1. Clustering Tendency

Hopkins test, a statistical test for spatial randomness of a variable, can be used to measure the probability of data points generated by uniform data distribution.

2. Number of Clusters (k)

There is no definitive answer for finding right number of cluster as it depends upon

  • (a) Distribution shape
  • (b) scale in the data set
  • (c) clustering resolution required by user.

Although finding number of clusters is a very subjective problem. There are two major approaches to find optimal number of clusters:

2.1 Domain Knowledge

Domain knowledge plays an important role to create and select clusters or segments. General guidelines to be followed while building the store segments: (a) each segment should be greater than 5% of data, (b) total number segments or clusters should not more than 5 or 7.

2.2 Data driven approach:

  • Empirical Method: An empirical method of finding number of clusters is Square root of N/2, where N is total number of data points. So that each cluster contains square root of 2 * N.
  • Elbow Method: The variance within a cluster is measure of compactness of the cluster. So, if the within cluster variance is lower, it means the compactness of cluster formed is higher.
  • Statistical Approach: We can use a statistical method named gap statistic to find the optimal number of clusters, represented by k.

3.  Clustering Quality

Ideally the clustering quality is often characterized by minimal intra-cluster (within) distance and maximal inter-cluster distance (between two clusters).

Two types of measures to assess the clustering quality or performance:

  1. Extrinsic Measures: Require ground truth labels. Examples are Adjusted Rand index, Fowlkes-Mallows scores, Mutual information-based scores, Homogeneity, Completeness and V-measure.
  2. Intrinsic Measures: Does not require ground truth labels. Some of the clustering performance measures are Silhouette Coefficient, Calinski-Harabasz Index, Davies-Bouldin Index etc.

Leave a Comment

Keytodatascience Logo

Connect

Subscribe

Join our email list to receive the latest updates.

© 2022 KeyToDataScience