The k-nearest neighbors (KNN) algorithm stands apart from all other machine learning techniques due to its simplicity and intuitiveness. KNN is a supervised machine learning algorithm that can be used to solve both classification and regression problems. KNN is a lazy learning and non-parametric algorithm. Many big terms! Let’s unpack and understand each term in detail.
Table of Contents
- Overview
- How KNN works?
- How to choose a k value?
- Types of Distance Metrics
- Advantages of KNN Algorithm
- Disadvantages of KNN Algorithm
- Python Code
Table of Contents
1. Overview
The k-nearest neighbors (KNN) algorithm classify data by estimating the likelihood, that a “new data point” will become a member of a group based on what group the data points nearest to it belong to.
KNN is a lazy learning and non-parametric algorithm.
- Lazy learning algorithm: KNN is a lazy learning algorithm (or lazy learner) because it doesn’t perform any training and just stores the data during the training phase. Only while prediction KNN used data for training and perform calculation
- Non-parametric learning algorithm: KNN is considered a non-parametric method because it doesn’t make any assumptions about the underlying data distribution. e.g. linear regression algorithm has to satisfy several assumptions before training model.
How KNN works? – Simple Explanation
To determine whether a “New data point” (blue dot) is in Category A or Category B, the algorithm looks at the data points near it (neighbors). Then we take a vote and assign the category to “New data point”, e.g. if the majority of neighbor data points are in Category A, it’s very likely that the “New data point” is in Category A and vice versa.
So as the name suggests, k-nearest neighbors or KNN classify a New data point based on the nearest annotated data points or the nearest neighbor.
NOTE: KNN classification and K-means clustering are different algorithms. KNN is a supervised classification algorithm that classifies new data points based on the nearest data points (neighbors). K-means clustering is an unsupervised clustering algorithm that groups data into a K number of clusters or segments.
2. How KNN works?
K-Nearest Neighbor algorithm pseudocode
The k-NN working can be explained on the basis of the below algorithm or pseudocode:
- Load the data
- Choose the value of k (i.e. the nearest data points). k can be any integer.
- To predict for each “test data point” in the test data (iterate from 1 to total number of training data points):
- Find the Euclidean distance to all training data samples. The other distance calculation metrics are Manhattan, Hamming, cosine distance, etc.
- Store the calculated distances in an array and sort in ascending order based on distance values.
- Select top k entries from the sorted array.
- Assign a class (or label) to “test data point” based on the majority of classes present in the selected k nearest points.
- Return the predicted class, End !
Above steps were for KNN Classification in which the prediction of test point is assigned using the majority class (or highest votes).
In the case of KNN regression, the majority of steps are the same. Instead of assigning the class with the majority class, the average of the k neighbors’ values is calculated and assigned to the "test data point".
3. How to choose optimal k value?
K value indicates the count of the nearest neighbors that will be used for majority voting. Choosing the value of k is a critical step. Well there is no rule or formula to derive the optimal value of k. However, we should follow some general guidelines to estimate a good value of k:
- Choose an odd value of k to avoid any ties (draw) when choosing majority votes of neighbor classes.
- Low values of k such as 1 or 2 are noisy and not advisable to use. Generally, k = 3 or 5 tends to provide better results.
- Small value of k is computationally efficient and a large value of k can become computationally expensive.
4. Types of Distance Metrics
There are various methods to calculate the distance measure between two data point: Euclidean distance, Manhattan distance, Hamming distance, Cosine distance and Minkowski distance. Out of the these, Euclidean distance is the most commonly used distance function or metric.
- For continuous data: Euclidian, Manhattan distance
- For categorical data: Hamming distance
Euclidean Distance: Euclidean distance is calculated as the square root of the sum of the squared differences between a new point (x) and an existing point (y).
Regardless of the number of variables or features, the Euclidean distance is the distance between two points in the Euclidean space:
- In case of 2 variables (x, y), there is a 2D space. The distance between two data points is √(x2−x1)2+ (y2−y1)2
- In a 3D space with 3 variables (x, y, z), the distance is √(x2−x1)2+ (y2−y1)2+ (z2−z1)2
- In case of 5 variables, there will be a 5D space. Let’s name them (a, b, c, d, e). The two data points would be (a1, b1, c1, d1, e1) and (a2, b2, c2, d2, e2). The distance between the 2 points will be:√(a2−a1)2+ (b2−b1)2+ (c2−c1)2+ (d2−d1)2+ (e2−e1)2
Manhattan Distance: This is the distance between two points using the sum of their absolute difference.
Hamming Distance: Hamming Distance is used for categorical variables.
- If the value (x) and the value (y) are same, the distance D will be equal to 0 .
- Otherwise D=1.
5. Advantages of KNN Algorithm
- Simple algorithm to understand, interpret and implement.
- Very useful for nonlinear data since there is no assumption about data (non parametric).
- Versatile algorithm: Can be used for classification as well as regression.
6. Disadvantages of KNN Algorithm
- Computationally expensive algorithm as training is done while doing the prediction (lazy learner).
- High memory storage required as compared to other supervised learning algorithms.
- Selecting optimal k can be tricky.
- Prediction is slow in case of big N.
- Being a distance based algorithm
- Very sensitive to irrelevant features.
- Scaling of data is necessary, either standardization or normalization.
- The curse of dimensionality: Means is that the data has too many features. The KNN algorithm is highly susceptible to overfitting due to the curse of dimensionality (if data has many features). However, this problem can be resolved with the brute force implementation of the KNN algorithm (not practical for large datasets). Or, during the data preparation phase dimensionality reduction techniques like principal component analysis (PCA) and feature selection must be used.
7. Python Code – KNN
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
path = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(path)
df.head()
X = df[['Pclass','SibSp','Parch']]
y = df['Survived']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
result = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(result)
result1 = classification_report(y_test, y_pred)
print("Classification Report:",)
print (result1)
result2 = accuracy_score(y_test,y_pred)
print("Accuracy:",round(result2,2))
Confusion Matrix: [[128 32] [ 52 56]] Classification Report: precision recall f1-score support 0 0.71 0.80 0.75 160 1 0.64 0.52 0.57 108 accuracy 0.69 268 macro avg 0.67 0.66 0.66 268 weighted avg 0.68 0.69 0.68 268 Accuracy: 0.69
error_rate = []
acc = []
for i in range(1,40):
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(X_train,y_train)
y_pred = knn.predict(X_test)
error_rate.append(np.mean(y_pred != y_test))
acc.append(metrics.accuracy_score(y_test, y_pred))
plt.figure(figsize=(16,6))
plt.subplot(1, 2, 1)
plt.plot(range(1,40),error_rate,color='purple', linestyle='dashed',
marker='o', markerfacecolor='white', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')
print("Minimum error :",round(min(error_rate),2),
"at K =",error_rate.index(min(error_rate)))
plt.subplot(1, 2, 2)
plt.plot(range(1,40),acc,color = 'purple',linestyle='dashed',
marker='o', markerfacecolor='white', markersize=10)
plt.title('accuracy vs. K Value')
plt.xlabel('K')
plt.ylabel('Accuracy')
print("Maximum accuracy:",round(max(acc),2),"at K =",acc.index(max(acc)))
Check out how to perform classifications and regression Model Evaluation here
Learn more Data Science Algorithms here