evaluating-deep-learning
Programming

Learn about the Confusion Matrix, Accuracy, Precision, and Recall when it comes to evaluating deep learning models

  •  
  •  
  •  
  •  
  •  
  •  

Object recognition is the issue of identifying one or more objects in an image in computer vision. Advanced deep learning models like R-CNN and YOLO, in addition to standard object detection techniques, can achieve impressive detection over a variety of objects. These models take an image as input and output the bounding box coordinates for each detected object.

The confusion matrix is discussed in this tutorial, as well as how precision, recall, and accuracy are measured. The mAP will be addressed in a later tutorial.

We’ll go over the following topics in detail:

  • Binary Classification Confusion Matrix
  • Multi-Class Classification Uncertainty Matrix
  • Accuracy, Precision, and Recall in Confusion Matrix Calculation with Scikit-learn Precision or Recall?

Final thoughts
Binary Classification Confusion Matrix

Each input sample is allocated to one of two classes of binary classification. In certain cases, these two groups are given labels such as 1 and 0, or positive and negative. More precisely, the two class labels could be malignant or benign (for example, whether the issue is cancer classification), or success or failure (e.g. if it is about classifying student test scores).

Assume you have a binary classification problem for positive and negative groups. The labels for seven of the samples used to train the model are shown below. These are referred to as the sample’s ground-truth marks.

positive, negative, negative, positive, positive, positive, negative

It’s worth noting that the class marks are used to assist us in distinguishing between the various grades. A quantitative score is extremely important to the model. When a single sample is fed to the algorithm, the model usually returns a score rather than a class mark. When these seven samples are fed into the model, their class scores may be as follows:

0.6, 0.2, 0.55, 0.9, 0.4, 0.8, 0.5

Each sample is assigned a class label based on its scores. What’s the best way to transform these numbers into labels? We accomplish this by employing a threshold. This threshold is a model hyperparameter that can be set by the consumer. For example, if the threshold is set to 0.5, any sample that is greater than or equal to 0.5 is considered positive. Otherwise, it’s a no-no. The following are the sample labels that were predicted:

positive (0.6), negative (0.2), positive (0.55), positive (0.9), negative (0.4), positive (0.8), positive (0.5)

Here are the expected and ground-truth labels for comparison. At first glance, we can see four accurate predictions and three wrong ones. It’s worth noting that adjusting the threshold can yield different results. Setting the threshold to 0.6, for example, results in just two incorrect predictions.

Ground-Truth: positive, negative, negative, positive, positive, positive, negative
Predicted: positive, negative, positive, positive, negative, positive, positive

The confusion matrix is used to extract more details about model efficiency. The confusion matrix allows us to see if the model is having trouble distinguishing between the two groups. It’s a 22 matrix, as seen in the next diagram. Positive and Negative are the labels for the two rows and columns, which correspond to the two class labels. The ground-truth labels are represented by the row labels, while the predicted labels are represented by the column labels in this example. This can be altered. gad-dl-metrics-0

The matrix’s four elements (the red and green items) represent the four metrics that count the number of correct and incorrect predictions made by the model. Each element has its own mark, which is made up of two words:

  1. Is it true or false?
  2. Positive or pessimism

It is True when the prediction is accurate (i.e. when the forecast and ground-truth labels match), and False when the predicted and ground-truth labels do not match. The expected mark is either positive or negative.

In summary, the first term is False if the prediction is incorrect. Otherwise, it’s accurate. The goal is to optimize the True Positive and True Negative metrics while minimizing the other two metrics (False Positive and False Negative). The uncertainty matrix’s four metrics are as follows:

  • How many times did the model correctly classify a Positive sample as Positive? Top-Left (True Positive): How many times did the model correctly classify a Positive sample as Positive?
  • How many times did the model incorrectly classify a Positive sample as Negative? Top-Right (False Negative): How many times did the model incorrectly classify a Positive sample as Negative?
  • How many times did the model incorrectly classify a Negative sample as Positive? Bottom-Left (False Positive): How many times did the model incorrectly classify a Negative sample as Positive?
  • How many times did the model correctly classify a Negative sample as Negative? Bottom-Right (True Negative): How many times did the model correctly classify a Negative sample as Negative?

These four metrics can be calculated for the seven predictions we saw earlier. The following diagram shows the resulting uncertainty matrix.gad-dl-metrics-1

For a binary classification problem, this is how the confusion matrix is determined. Let’s look at how it’d be done for a multi-class problem.

Multi-Class Classification Uncertainty Matrix

What if there are more than two groups in our group? For a multi-class classification query, how do we measure these four metrics in the confusion matrix? It’s that easy!

Assume there are nine samples, each of which falls into one of three categories: white, black, or red. The ground-truth data for the 9 samples as shown below.

Red, Black, Red, White, White, Red, Black, Red, White

Here are the projected labels after the samples are fed into a model.

Red, White, Black, White, Red, Red, Black, White, Red

Here they are side by side for better comparison.

Ground-Truth: Red, Black, Red, White, White, Red, Black, Red, White
Predicted: Red, White, Black, White, Red, Red, Black, White, Red

A target class must be defined before the confusion matrix can be calculated. Assign the goal to the Red class. The other classes are marked as Negative, while this class is marked as Positive.

Positive, Negative, Positive, Negative, Negative, Positive, Negative, Positive, Negative
Positive, Negative, Negative, Negative, Positive, Positive, Negative, Negative, Positive

There are only two classes left now (Positive and Negative). As a result, the confusion matrix can be computed in the same way as in the previous section. This matrix only applies to the Red class.gad-dl-metrics-2

Replace all instances of the White class with Positive and all other class labels with Negative. Here are the expected and ground-truth labels after replacement. The uncertainty matrix for the White class is shown in the next diagram.

Negative, Negative, Negative, Positive, Positive, Negative, Negative, Negative, Positive
Negative, Positive, Negative, Positive, Negative, Negative, Negative, Positive, Negative
gad-dl-metrics-3

Using Scikit-Learn to measure the Uncertainty Matrixgad-dl-metrics-4

The metrics module in Python’s common Scikit-learn library can be used to measure the metrics in the confusion matrix.

The confusion matrix() function is used to solve binary-class problems. We use the following two parameters among the agreed ones:

The uncertainty matrix for the binary classification example we discussed earlier is calculated using the code below.

y_true: The ground-truth labels.
y_pred: The predicted labels.

The uncertainty matrix for the binary classification example we discussed earlier is calculated using the code below.

import sklearn.metrics

y_true = [“positive”, “negative”, “negative”, “positive”, “positive”, “positive”, “negative”]
y_pred = [“positive”, “negative”, “positive”, “positive”, “negative”, “positive”, “positive”]

r = sklearn.metrics.confusion_matrix(y_true, y_pred)
print(r)

array([[1, 2],
[1, 3]], dtype=int64)

It’s worth noting that the metrics aren’t in the same order as they were previously. The True Positive metric, for example, is located in the bottom-right corner, while True Negative is located in the top-left corner. We can solve this by flipping the matrix.

import numpy

r = numpy.flip(r)
print(r)

array([[3, 1],
[2, 1]], dtype=int64)

As shown below, the multilabel confusion matrix() function is used to calculate the confusion matrix for a multi-class classification query. A third parameter, labels, accepts a list of the class labels, in addition to the y true and y pred parameters.

import sklearn.metrics
import numpy

y_true = [“Red”, “Black”, “Red”, “White”, “White”, “Red”, “Black”, “Red”, “White”]
y_pred = [“Red”, “White”, “Black”, “White”, “Red”, “Red”, “Black”, “White”, “Red”]

r = sklearn.metrics.multilabel_confusion_matrix(y_true, y_pred, labels=[“White”, “Black”, “Red”])

array([
[[4 2]
[2 1]]

[[6 1]
 [1 1]]

[[3 2]
 [2 2]]], dtype=int64)

The function returns all the matrices after computing the confusion matrix for each class. The matrices’ order corresponds to the labels parameter’s order of labels. As before, we’ll use the numpy.flip() function to reorder the metrics in the matrices.

print(numpy.flip(r[0])) # White class confusion matrix
print(numpy.flip(r[1])) # Black class confusion matrix
print(numpy.flip(r[2])) # Red class confusion matrix

White class confusion matrix

[[1 2]
[2 4]]

Black class confusion matrix

[[1 1]
[1 6]]

Red class confusion matrix

[[2 2]
[2 3]]

We’ll concentrate on only two classes for the remainder of this tutorial. The confusion matrix is used to measure three main metrics, which are discussed in the following section.

Accuracy, Precision, and Recall

As we’ve shown, the confusion matrix provides four distinct and distinct metrics. Other metrics can be estimated based on these four metrics to provide more detail about how the model behaves:

  • Accuracy
  • Precision
  • Recall

The next subsections discuss each of these three metrics.

Accuracy

Accuracy is a statistic that sums up how well a model performs in all groups. It’s helpful when all of the classes are equally important. The ratio between the number of accurate predictions and the total number of predictions is used to calculate it.gad-dl-metrics-5

Based on the confusion matrix previously determined, here’s how to measure the accuracy with Scikit-learn. The consequence of dividing the sum of True Positives and True Negatives over the sum of all values in the matrix is stored in the variable acc. The result is 0.5714, indicating that the model is 57.14 percent accurate in predicting the outcome.

import numpy
import sklearn.metrics

y_true = [“positive”, “negative”, “negative”, “positive”, “positive”, “positive”, “negative”]
y_pred = [“positive”, “negative”, “positive”, “positive”, “negative”, “positive”, “positive”]

r = sklearn.metrics.confusion_matrix(y_true, y_pred)

r = numpy.flip(r)

acc = (r[0][0] + r[-1][-1]) / numpy.sum(r)
print(acc)

0.571

The accuracy score() function in the sklearn.metrics module can also be used to measure accuracy. As arguments, it accepts the ground truth and expected marks.

acc = sklearn.metrics.accuracy_score(y_true, y_pred)

It’s worth noting that the precision can be deceiving. When the data is unbalanced, for example. Assume there are 600 samples in total, with 550 in the Positive category and just 50 in the Negative category. Since the bulk of the samples belong to one class, the class’s accuracy would be higher than the other.

The cumulative accuracy is (530 + 5) / 600 = 0.8917 if the model made 530/550 accurate predictions for the Positive class and only 5/50 for the Negative class. This indicates that the model is 89.17% accurate. With that in mind, you might believe that the model would make a right prediction 89.17 percent of the time for any sample (regardless of its class). This isn’t true, particularly when it comes to the Negative class, where the model failed miserably.

Exactitude

The precision is measured as the ratio of the number of correctly labelled Positive samples to the total number of Positive samples (either correctly or incorrectly). The precision of the model in classifying a sample as positive is measured.gad-dl-metrics-6

When the model allows a large number of incorrect Positive classifications or a small number of correct Positive classifications, the denominator rises and the precision falls. The precision, on the other hand, is high when:

Many correct Positive classifications are made by the model (maximize True Positive).
Positive classifications are less likely to be wrong with this model (minimize False Positive).
Consider a guy who is well-liked by everyone, and who is believed when he makes predictions. This man exemplifies precision. You can trust the model when it predicts a sample as Positive when the precision is high. As a result, precision aids in determining how correct the model is when it declares a sample to be Positive.

Following is a description of precision based on the previous discussion:

The precision indicates how accurate the model is at classifying Positive samples.

The green mark in the next diagram denotes a Positive sample, while the red mark denotes a Negative sample. Two Positive samples were correctly classified, but one Negative sample was incorrectly classified as Positive. The precision is 2/(2+1)=0.667, and the True Positive rate is 2 and the False Positive rate is 1. In other words, the model’s trustworthiness percentage when a sample is Positive is 66.7 percent.gad-dl-metrics-7

The precision’s aim is to correctly classify all Positive samples as Positive while avoiding misclassifying a negative sample as Positive.

The precision is 3/(3+1)=0.75 if all three Positive samples are correctly classified while one Negative sample is incorrectly classified, as shown in the next figure. As a result, when the model says a sample is positive, it is 75 percent accurate.

In addition to not misclassifying a Negative sample as Positive, the only way to achieve 100 percent accuracy is to identify all Positive samples as Positive.

Precision score() is a feature in the sklearn.metrics module of Scikit-learn that accepts the ground-truth and expected labels and returns the precision. The pos label parameter accepts the Positive class’s label. It is set to 1 by default.

import sklearn.metrics

y_true = [“positive”, “positive”, “positive”, “negative”, “negative”, “negative”]
y_pred = [“positive”, “positive”, “negative”, “positive”, “negative”, “negative”]

precision = sklearn.metrics.precision_score(y_true, y_pred, pos_label=”positive”)
print(precision)

0.6666666666666666

Recall

The recall is determined by dividing the total number of Positive samples by the number of Positive samples correctly identified as Positive. The model’s ability to identify Positive samples is measured by the recall. The higher the recall, the greater the number of positive samples found.

The recall is only concerned with the classification of positive samples. This is true regardless of how negative samples are graded, for example, for accuracy. And if all the negative samples were wrongly labelled as Positive, the recall would be 100% if the model classifies all the positive samples as Positive. Let’s take a look at a few cases.

The next figure shows four distinct cases (A to D), all of which have the same recall of 0.667. The only distinction in each case is how the negative samples are classified. Case A, for example, correctly classifies all negative samples as negative, while case D incorrectly classifies all negative samples as positive. The recall only cares about the positive samples, regardless of how the negative samples are classified.

Just two positive samples are correctly labelled as positive in the four cases shown above. As a result, the True Positive rate is two. Since only one positive sample is counted as negative, the False Negative rate is one. As a consequence, 2/(2+1)=2/3=0.667 is the recall.

Since it makes no difference if the negative samples are labelled as positive or negative, it’s best to ignore them entirely, as seen in the next figure. When measuring the recall, only the positive samples must be taken into account.

What does it mean to have a high or low recall? When the recall is strong, the model will correctly identify all positive samples as Positive. As a consequence, the model’s ability to detect positive samples can be trusted.

Since all of the positive samples were correctly labelled as Positive, the recall in the next figure is 1.0. The False Negative rate is 0 and the True Positive rate is 3. As a result, the recall is 3/(3+0)=1. This indicates that the model correctly identified all of the positive samples. Since the recall ignores how negative samples are categorised, many negative samples may still be classified as positive (i.e. a high False Positive rate). This isn’t taken into account in the recall.

If it fails to detect any positive sample, however, the recall is 0.0. All of the positive samples are wrongly counted as Negative in the next diagram. This means that the model missed 0% of the positive samples. The False Negative rate is 3, while the True Positive rate is 0. As a result, the recall is 0/(0+3)=0.

When the recall is between 0.0 and 1.0, the percentage of positive samples that the model correctly identified as Positive is reflected. For eg, if there are 10 positive samples and the recall is 0.6, the model correctly classified 60% of the positive samples (i.e. 0.6*10=6).

The recall score() function in the sklearn.metrics module measures the recall in the same way as the precision score() function. An example is shown in the next block of code.

import sklearn.metrics

y_true = [“positive”, “positive”, “positive”, “negative”, “negative”, “negative”]
y_pred = [“positive”, “positive”, “negative”, “positive”, “negative”, “negative”]

recall = sklearn.metrics.recall_score(y_true, y_pred, pos_label=”positive”)
print(recall)

0.6666666666666666

Let’s recap what we’ve learned about precision and recall so far:

The accuracy of the model in classifying positive samples is measured, and the recall is the number of positive samples correctly identified by the model.

The precision factor considers how both positive and negative samples were categorised, while the recall factor only considers positive samples. To put it another way, accuracy is based on both negative and positive samples, while recall is based solely on positive samples (and independent of the negative samples).
When a sample is labelled as Positive, accuracy is taken into account, but it is unconcerned with correctly classifying all positive samples. The recall is concerned with correctly classifying all positive samples, but it is unconcerned if a negative sample is incorrectly categorised.

When a model has a high recall but a low accuracy, it correctly classifies the majority of positive samples but has a lot of false positives (i.e. classifies many Negative samples as Positive). When a model has a high precision but a low recall, it is accurate when it classifies a sample as Positive but only a few positive samples can be classified.


Here are some questions to see if you know what you’re talking about:

How many positive samples were correctly classified by the model if the recall is 1.0 and the dataset has 5 positive samples? (No. 5)
How many positive samples were correctly identified by the model when the recall is 0.3 and the dataset has 30 positive samples? (9 samples = 0.3*30)
How many positive samples were correctly identified by the model if the recall is 0.0 and the dataset has 14 positive samples? (none)

Which is more important: precision or recall?

The type of problem to be solved determines whether precision or recall should be used. Use recall if the target is to detect all positive samples (without regard for whether or not negative samples would be misclassified as positive). If the issue is vulnerable to classifying a sample as Positive in general, such as Negative samples that were incorrectly labelled as Positive, use precision.

Consider being given a picture and being asked to find all of the cars inside it. What metric do you employ? Use recall because the target is to detect all of the vehicles. This will misclassify certain objects as automobiles, but it will ultimately detect all of the target objects.

Let’s assume you’re given a mammography picture and asked to determine whether or not there’s cancer present. What metric do you employ? We must be cautious when classifying a picture as Positive since there is a risk of wrongly labelling it as cancerous (i.e. has cancer). As a result, precision is the metric of choice.

Final thoughts

In both binary and multiclass classification problems, this tutorial covered the confusion matrix and how to measure its four metrics (true/false, positive/negative). We saw how to measure the uncertainty matrix in Python using the metrics module in Scikit-learn.

We delved into a discussion of accuracy, precision, and recall based on these four metrics. Each metric is described using a variety of examples. Each of them is calculated using the sklearn.metrics module.

We’ll see how to use the precision-recall curve, average precision, and mean average precision in the next tutorial, which is based on the concepts discussed here (mAP).


  •  
  •  
  •  
  •  
  •  
  •