Decoding the Confusion Matrix

In the case of a classification problem, having only one classification accuracy might not give you the whole picture. So, a confusion matrix or error matrix is used for summarizing the performance of a classification algorithm.

Calculating a confusion matrix can give you an idea of where the classification model is right and what types of errors it is making.

A confusion matrix is used to check the performance of a classification model on a set of test data for which the true values are known. Most performance measures such as precision, recall are calculated from the confusion matrix.

This article aims at:
1. What is a confusion matrix and why it is needed.
2. How to calculate a confusion matrix for a 2-class classification problem using a cat-dog example.
3. How to create a confusion matrix in Python & R.
4. Summary and intuition on different measures: Accuracy, Recall, Precision & Specificity

1. Confusion Matrix:

A confusion matrix provides a easy summary of the predictive results in a classification problem. Correct and incorrect predictions are summarized in a table with their values and broken down by each class.

Confusion Matrix for the Binary Classification

We can not rely on a single value of accuracy in classification when the classes are imbalanced. For example, we have a dataset of 100 patients in which 5 have diabetes and 95 are healthy. However, if our model only predicts the majority class i.e. all 100 people are healthy even though we have a classification accuracy of 95%. Therefore, we need a confusion matrix.

2. Calculate a confusion matrix:

Let’s take an example:

We have a total of 10 cats and dogs and our model predicts whether it is a cat or not.

Actual values = [‘dog’, ‘cat’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘cat’, ‘dog’]
Predicted values = [‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’, ‘cat’, ‘cat’, ‘cat’]

Confusion Matrix Example
Remember, we describe predicted values as Positive/Negative and actual values as True/False.

Definition of the Terms:

True Positive: You predicted positive and it’s true. You predicted that an animal is a cat and it actually is.

True Negative: You predicted negative and it’s true. You predicted that animal is not a cat and it actually is not (it’s a dog).

False Positive (Type 1 Error): You predicted positive and it’s false. You predicted that animal is a cat but it actually is not (it’s a dog).

False Negative (Type 2 Error): You predicted negative and it’s false. You predicted that animal is not a cat but it actually is.

Classification Accuracy:

Classification Accuracy is given by the relation:


Recall (aka Sensitivity):

Recall is defined as the ratio of the total number of correctly classified positive classes divide by the total number of positive classes. Or, out of all the positive classes, how much we have predicted correctly. Recall should be high.



Precision is defined as the ratio of the total number of correctly classified positive classes divided by the total number of predicted positive classes. Or, out of all the predictive positive classes, how much we predicted correctly. Precision should be high.


Trick to remember: Precision has Predictive Results in the denominator.

F-score or F1-score:

It is difficult to compare two models with different Precision and Recall. So to make them comparable, we use F-Score. It is the Harmonic Mean of Precision and Recall. As compared to Arithmetic Mean, Harmonic Mean punishes the extreme values more. F-score should be high.

F Score


Specificity determines the proportion of actual negatives that are correctly identified.


Example to interpret confusion matrix:

Let’s calculate confusion matrix using above cat and dog example:

Classification Accuracy:
Accuracy = (TP + TN) / (TP + TN + FP + FN) = (3+4)/(3+4+2+1) = 0.70

Recall: Recall gives us an idea about when it’s actually yes, how often does it predict yes.
Recall = TP / (TP + FN) = 3/(3+1) = 0.75

Precision: Precision tells us about when it predicts yes, how often is it correct.
Precision = TP / (TP + FP) = 3/(3+2) = 0.60

F-score = (2*Recall*Precision)/(Recall+Precision) = (2*0.75*0.60)/(0.75+0.60) = 0.67

Specificity = TN / (TN + FP) = 4/(4+2) = 0.6

3. Create a confusion matrix in Python & R

Let’s use both python and R codes to understand the above dog and cat example that will give you a better understanding of what you have learned about the confusion matrix so far.

PYTHON: First let’s take the python code to create a confusion matrix. We have to import the confusion matrix module from sklearn library which helps us to generate the confusion matrix.

Below is the Python implementation of the above explanation:

 # Python script for confusion matrix creation.
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
actual = ['dog','cat', 'dog', 'cat', 'dog', 'dog', 'cat', 'dog', 'dog', 'cat']
predicted = ['dog', 'dog', 'dog', 'cat', 'dog', 'dog', 'cat', 'cat', 'cat', 'cat']
results = confusion_matrix(actual, predicted)
print ('Confusion Matrix :')
print ('Accuracy Score :',accuracy_score(actual, predicted))
print('Classification Report : ')
print (classification_report(actual, predicted))

Confusion Matrix :
[[3 1]
 [2 4]]

Accuracy Score: 0.7

Classification Report: 
                        precision     recall    f1-score   support
cat                      0.60            0.75      0.67          4
dog                    0.80            0.67      0.73          6 

micro avg           0.70            0.70      0.70         10
macro avg          0.70            0.71      0.70         10
weighted avg     0.72            0.70      0.70         10 

R: Let’s use R code to create a confusion matrix now. We will use the caret library in R to calculate the confusion matrix.

actual <- c('dog','cat', 'dog', 'cat', 'dog', 'dog', 'cat', 'dog', 'cat', 'dog')
predicted <- c('dog', 'dog', 'dog', 'cat', 'dog', 'dog', 'cat', 'cat', 'cat', 'cat')
results <- confusionMatrix(data=predicted, reference=actual)

Confusion Matrix and Statistics

Prediction     0    1
                   0  4    1
                    1  2    3

Accuracy : 0.7 
95% CI : (0.3475, 0.9333)
No Information Rate : 0.6 
P-Value [Acc > NIR] : 0.3823

Kappa : 0.4
Mcnemar's Test P-Value : 1.0000Sensitivity : 0.6667 

Specificity : 0.7500 
Pos Pred Value : 0.8000 
Neg Pred Value : 0.6000 
Prevalence : 0.6000 
Detection Rate : 0.4000 
Detection Prevalence : 0.5000 
Balanced Accuracy : 0.7083

'Positive' Class : 0

4. Summary:

  • Precision is how certain you are of your true positives. Recall is how certain you are that you are not missing any positives.
  • Choose Recall if the occurrence of false negatives is unaccepted/intolerable. For example, in the case of diabetes that you would rather have some extra false positives (false alarms) over saving some false negatives.
  • Choose Precision if you want to be more confident of your true positives. For example, in case of spam emails, you would rather have some spam emails in your inbox rather than some regular emails in your spam box. You would like to be extra sure that email X is spam before we put it in the spam box.
  • Choose Specificity if you want to cover all true negatives, i.e. meaning we do not want any false alarms or false positives. For example, in case of a drug test in which all people who test positive will immediately go to jail, you would not want anyone drug-free going to jail.

We can conclude that:

  • Accuracy value of 70% means that identification of 3 of every 10 cats is incorrect, and 7 is correct.
  • Precision value of 60% means that label of 4 of every 10 cats is a not a cat (i.e. a dog), and 6 are cats.
  • Recall value is 70% means that 3 of every 10 cats, in reality, are missed by our model and 7 are correctly identified as cats.
  • Specificity value is 60% means that 4 of every 10 dogs (i.e. not cat) in reality are miss-labeled as cats and 6 are correctly labeled as dogs.

If you have any comments or questions feel free to leave your feedback below. You can always reach me on LinkedIn.

Start your Data Science Journey here.

2 thoughts on “Decoding the Confusion Matrix”

  1. Prateek, Confusion Matrix Very nicely explained. I have a query, your expert help is required. I did binary prediction through XGboost model and when I obtain the confusion matrix, I get MacNemar Test with p<0.05. How should I interpret it? Is it showing a significant difference between the 1st model when the algorithm started and the last model when the algorithm stopped? If it is not so, then what is it actually telling or if yes, please share academic reference. I obtained the confusion matrix from the caret package in R.

    • Hi Arun, I believe you have obtained the confusion matrix after doing prediction on the validation set. According to my knowledge, the answer to your question is no. In this scenario, McNemar Test is actually checking the null hypothesis that the proportion of Type1 and Type2 errors are the same. Thus, the alternative hypothesis is that the proportion of both the errors are not equal. MacNemar test is considering the confusion matrix as a contingency matrix.

      I think you may be considering the case in which McNemar’s test is used to compare 2 machine learning classifier models on a test dataset. In that case, we make a contingency matrix and check the top right and bottom left cells (disagreement cases).

      • If these cells have similar counts, it shows that both models make errors almost in the same proportion, just on different instances of the test set. So in this case, the result of the test would not be significant and the null hypothesis would not be rejected.
      • If these cells have different counts, the result of the test would be significant and we would reject the null hypothesis.

      So in your case, the Type1 and Type2 error must have very contrasting values (might be because of Imbalanced data). You can perform McNemar test on your confusion matrix and you will get the same result as you got in confusion matrix summary. Just be careful while using this test if the cells (disagreement cases) has a count of less than 25. Then, you have to use a modified version of the test that calculates an exact p-value using a binomial distribution. Hope this helps!
      Useful links: Link1 Link2


Leave a Comment

Keytodatascience Logo



Join our email list to receive the latest updates.

© 2022 KeyToDataScience