In the case of a classification problem, having only one classification accuracy might not give you the whole picture. So, a confusion matrix or error matrix is used for summarizing the performance of a classification algorithm.
Calculating a confusion matrix can give you an idea of where the classification model is right and what types of errors it is making.
A confusion matrix is used to check the performance of a classification model on a set of test data for which the true values are known. Most performance measures such as precision, recall are calculated from the confusion matrix.
This article aims at:
1. What is a confusion matrix and why it is needed.
2. How to calculate a confusion matrix for a 2-class classification problem using a cat-dog example.
3. How to create a confusion matrix in Python & R.
4. Summary and intuition on different measures: Accuracy, Recall, Precision & Specificity
1. Confusion Matrix:
A confusion matrix provides a easy summary of the predictive results in a classification problem. Correct and incorrect predictions are summarized in a table with their values and broken down by each class.
We can not rely on a single value of accuracy in classification when the classes are imbalanced. For example, we have a dataset of 100 patients in which 5 have diabetes and 95 are healthy. However, if our model only predicts the majority class i.e. all 100 people are healthy even though we have a classification accuracy of 95%. Therefore, we need a confusion matrix.
2. Calculate a confusion matrix:
Let’s take an example:
We have a total of 10 cats and dogs and our model predicts whether it is a cat or not.
Actual values = [‘dog’, ‘cat’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘cat’, ‘dog’]
Predicted values = [‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’, ‘cat’, ‘cat’, ‘cat’]
Definition of the Terms:
True Positive: You predicted positive and it’s true. You predicted that an animal is a cat and it actually is.
True Negative: You predicted negative and it’s true. You predicted that animal is not a cat and it actually is not (it’s a dog).
False Positive (Type 1 Error): You predicted positive and it’s false. You predicted that animal is a cat but it actually is not (it’s a dog).
False Negative (Type 2 Error): You predicted negative and it’s false. You predicted that animal is not a cat but it actually is.
Classification Accuracy is given by the relation:
Recall (aka Sensitivity):
Recall is defined as the ratio of the total number of correctly classified positive classes divide by the total number of positive classes. Or, out of all the positive classes, how much we have predicted correctly. Recall should be high.
Precision is defined as the ratio of the total number of correctly classified positive classes divided by the total number of predicted positive classes. Or, out of all the predictive positive classes, how much we predicted correctly. Precision should be high.
Trick to remember: Precision has Predictive Results in the denominator.
F-score or F1-score:
It is difficult to compare two models with different Precision and Recall. So to make them comparable, we use F-Score. It is the Harmonic Mean of Precision and Recall. As compared to Arithmetic Mean, Harmonic Mean punishes the extreme values more. F-score should be high.
Specificity determines the proportion of actual negatives that are correctly identified.
Example to interpret confusion matrix:
Let’s calculate confusion matrix using above cat and dog example:
Accuracy = (TP + TN) / (TP + TN + FP + FN) = (3+4)/(3+4+2+1) = 0.70
Recall: Recall gives us an idea about when it’s actually yes, how often does it predict yes.
Recall = TP / (TP + FN) = 3/(3+1) = 0.75
Precision: Precsion tells us about when it predicts yes, how often is it correct.
Precision = TP / (TP + FP) = 3/(3+2) = 0.60
F-score = (2*Recall*Precision)/(Recall+Presision) = (2*0.75*0.60)/(0.75+0.60) = 0.67
Specificity = TN / (TN + FP) = 4/(4+2) = 0.6
3. Create a confusion matrix in Python & R
Let’s use both python and R codes to understand the above dog and cat example that will give you a better understanding of what you have learned about the confusion matrix so far.
PYTHON: First let’s take the python code to create a confusion matrix. We have to import the confusion matrix module from sklearn library which helps us to generate the confusion matrix.
Below is the Python implementation of the above explanation:
# Python script for confusion matrix creation.<code>
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
actual = ['dog','cat', 'dog', 'cat', 'dog', 'dog', 'cat', 'dog', 'dog', 'cat']
predicted = ['dog', 'dog', 'dog', 'cat', 'dog', 'dog', 'cat', 'cat', 'cat', 'cat']
results = confusion_matrix(actual, predicted)
print ('Confusion Matrix :')
print ('Accuracy Score :',accuracy_score(actual, predicted))
print('Classification Report : ')
print (classification_report(actual, predicted))
OUTPUT -> Confusion Matrix : [[3 1] [2 4]] Accuracy Score: 0.7 Classification Report: precision recall f1-score support cat 0.60 0.75 0.67 4 dog 0.80 0.67 0.73 6 micro avg 0.70 0.70 0.70 10 macro avg 0.70 0.71 0.70 10 weighted avg 0.72 0.70 0.70 10
R: Let’s use R code to create a confusion matrix now. We will use the caret library in R to calculate the confusion matrix.
actual <- c('dog','cat', 'dog', 'cat', 'dog', 'dog', 'cat', 'dog', 'cat', 'dog')
predicted <- c('dog', 'dog', 'dog', 'cat', 'dog', 'dog', 'cat', 'cat', 'cat', 'cat')
results <- confusionMatrix(data=predicted, reference=actual)
OUTPUT -> Confusion Matrix and Statistics Reference Prediction 0 1 0 4 1 1 2 3 Accuracy : 0.7 95% CI : (0.3475, 0.9333) No Information Rate : 0.6 P-Value [Acc > NIR] : 0.3823 Kappa : 0.4 Mcnemar's Test P-Value : 1.0000Sensitivity : 0.6667 Specificity : 0.7500 Pos Pred Value : 0.8000 Neg Pred Value : 0.6000 Prevalence : 0.6000 Detection Rate : 0.4000 Detection Prevalence : 0.5000 Balanced Accuracy : 0.7083 'Positive' Class : 0
- Precision is how certain you are of your true positives. Recall is how certain you are that you are not missing any positives.
- Choose Recall if the occurrence of false negatives is unaccepted/intolerable. For example, in the case of diabetes that you would rather have some extra false positives (false alarms) over saving some false negatives.
- Choose Precision if you want to be more confident of your true positives. For example, in case of spam emails, you would rather have some spam emails in your inbox rather than some regular emails in your spam box. You would like to be extra sure that email X is spam before we put it in the spam box.
- Choose Specificity if you want to cover all true negatives, i.e. meaning we do not want any false alarms or false positives. For example, in case of a drug test in which all people who test positive will immediately go to jail, you would not want anyone drug-free going to jail.
We can conclude that:
- Accuracy value of 70% means that identification of 3 of every 10 cats is incorrect, and 7 is correct.
- Precision value of 60% means that label of 4 of every 10 cats is a not a cat (i.e. a dog), and 6 are cats.
- Recall value is 70% means that 3 of every 10 cats, in reality, are missed by our model and 7 are correctly identified as cats.
- Specificity value is 60% means that 4 of every 10 dogs (i.e. not cat) in reality are miss-labeled as cats and 6 are correctly labeled as dogs.
If you have any comments or questions feel free to leave your feedback below. You can always reach me on LinkedIn.
Start your Data Science Journey here.