K-Means Clustering is an unsupervised machine learning algorithm, which is used when we have unlabeled data (i.e., data without y or dependent variable).The goal of this algorithm is to find groups/ clusters/ segments in the data. The number of groups represented by the variable k. The K Means algorithm iteratively assign each data point to one of k groups based on the input features. The data points are clustered based on feature similarity.
A simple example of clustering is:
A bank wants to provide new credit card to its present customers. The bank may have millions of customers, out of which they have to choose the most potential customers that will use credit card. How can they proceed?
- Option 1: Check details of each customer separately and then make a decision. It will be a manual process that will take a huge amount of time and man power.
- Option 2: Use some rules (or logic) to filter customers potential customers. E.g. customers who have high monthly balance above certain threshold, or customers who use debit card frequently, etc. A drawback for this approach is that it will be difficult to create logics based on multiple factors. E.g. if we use high monthly balance threshold we might get retired or old customers that will not use credit card a lot. There can be other cases where we might miss potential customers.
- Option 3: Use clustering algorithm to create multiple groups (or clusters) of customers with similar characteristics. Now the bank can can make different strategies or offers for each group (or cluster) instead of creating different strategies for individual customers. E.g. bank can first provide credit cards to the cluster in which customers are young and salaried.
Now we got some idea regarding the use case of clustering. Let’s understand the K-means algorithm in detail.
Table of Contents
- Overview
- How K Means Algorithm works?
- Stopping Criteria for K-Means Clustering
- How to choose a k value?
- Advantages of K-means Algorithm
- Disadvantages of K-means Algorithm
- Video
- Python Code
Table of Contents
1. Overview
K-Means Clustering algorithm groups the unlabeled dataset into different clusters (groups or segments). k is the number of pre-defined clusters that will be created, e.g. if k=2, there will be two clusters, if k=3, there will be three clusters, and so on.
Each data point in a dataset belongs to only one group. All the data point in a group should have similar properties.
The below image is an working example of the K-means Clustering Algorithm:
2. How K-Means Algorithm Work?
The K-Means algorithm working is explained in the below steps or pseudocode:
- Load the data
- Choose the value of k to decide the number of clusters. k can be any integer.
- Select random k points or centroids.
- The algorithm works iteratively to assign each data point to one of k groups
- Find the Euclidean distance from centroids to all data points. The other distance calculation metrics are Manhattan, Hamming, cosine distance, etc.
- Assign each data point to their closest centroid, which will form the predefined k clusters.
- Re-compute the centroids of newly formed clusters.
- Repeat the step 4, (which means reassign each data point to the new closest centroid of each cluster), till a stopping criteria is reached.
- End !
Stopping Criteria for K-Means Clustering
There are generally two stopping criteria used to stop the iterations of K-means algorithm:
- Centroids of newly formed clusters do not change, which means data points will remain in the same cluster.
- Maximum number of iterations are reached
3. How to choose optimal k value?
We have in the previous section that the value of k needs to be chosen beforehand. The performance of the K-means clustering algorithm depends on the optimal and packed clusters. So, let’s see how to choose the optimal number of clusters using the techniques given below:
3.1 Elbow Method
The Elbow method is the go to method to find the optimal number of clusters. It uses the concept of WCSS value. WCSS stands for Within Cluster Sum of Squares, means the total variations within a cluster. In simple words, it is sum of the squared distances between each data point and its centroid and calculates the average distance within a cluster. To measure the distance between data points and centroid, we can use any method such as Euclidean distance, Manhattan distance or cosine distance, etc.
To find the optimal value of clusters, the elbow method follows the below steps:
- Perform the K-means clustering multiple times using various k values (from 1-10).
- For each value of k, calculates the WCSS value.
- Plots a curve between calculated WCSS values (sum of squared distance) and the number of clusters k.
- Look for sharp bend in the curve (looks like an arm or elbow), that point is considered as the optimal value of k.
The above plot shows the sharp bend, which looks like an elbow. Therefore, it is called the elbow method.
3.2 Silhouette Analysis
Silhouette analysis is also a popular technique used to find optimal value of k. This method calculates the degree of separation between clusters.
4. Advantages of K-means Algorithm
- Simple algorithm to understand and implement.
- Suitable to implement even if the dataset is large
5. Disadvantages of K-means Algorithm
- Selecting optimal k can be tricky.
- Being a distance based algorithm
- Very sensitive to irrelevant features.
- Scaling of data is necessary, either standardization or normalization.
- K means algorithm is good in capturing clusters in spherical shaped data. Because we use centroid to calculate nearest points to assign to a cluster, it always try to construct a spherical shape around the centroid. So in case of complicated geometric shapes, k-means does not perform well.
- Let’s see an example: As observed in the image below there are 2 line clusters, however k-means clustering gives different results in this case.
6. Video
7. Python Code
1. Import the libraries
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
2. Next, load the data set
dataset, _ = make_blobs(n_samples=200,n_features=2,
centers=4, cluster_std=0.5, random_state=0)
df = pd.DataFrame(dataset,columns=['var1', 'var2'])
df.head()
3. Perform Feature Scaling
scaler = StandardScaler()
scaler.fit(df)
X = scaler.transform(df)
4. Find the optimal number of k value using Elbow Plot
sse = {}
for k in range(1, 11):
kmeans = KMeans(n_clusters=k, random_state=30)
kmeans.fit(X)
sse[k] = kmeans.inertia_ # SSE to closest cluster centroid
plt.title('The Elbow Method')
plt.xlabel('k')
plt.ylabel('SSE')
sns.pointplot(x=list(sse.keys()), y=list(sse.values()))
plt.show()
Using Elbow Plot, we found the bend or so optimal value of k is at k= 4
5. Let’s train the K-means model using k=4
kmeans = KMeans(n_clusters=4, random_state=30)
y_pred = kmeans.fit(X)
# Obtain all clusters which are unique
df["Cluster"] = kmeans.labels_
6. In the end, plot the results with 4 clusters in different colors and their centroid in black color.
plt.scatter(X[:, 0], X[:, 1], c=df.Cluster, s=30, cmap='tab20b')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=100, label = 'Centroid')
plt.title('K Means Clustering - KeytoDataScience')
plt.xlabel('Var1')
plt.ylabel('Var2')
# plt.legend()
plt.show()
Learn more Data Science Algorithms here