Evaluating classification models. Accuracy, Precision and Recall.

In this article, I am going to delve into some metrics used to measure how well classifiers do their job. So after reading this article you will know how to evaluate classification models and know the difference between the different metrics that we can encounter evaluating classifier models.

In the image shown above, we can see a classification problem. How can we know if this model is good or bad? Let’s delve into this in the following paragraphs.

The most common metric.

The goal of each classifier is to assign one label to one input according to their characteristics, in other words, classifiers can distinguish the instances belonging to different categories. But how can we measure how well a classifier performs this task?

The first idea that comes to our minds is probably to calculate the ratio between the correct predictions and the total of instances. This is, in fact, the definition of accuracy, formally it is:

With the equation shown above, we can calculate how well our model performance is, right? One issue using this method is that in many scenarios we have unbalanced datasets, this is, the number of instances belonging to each class is different. Let’s see a simple example and let’s suppose we have a classifier that identifies malignant tumors and benign tumors. Our imaginary dataset conforms in the following way:

10 malignant tumors.
90 benign tumors.

If we have a classification model that for every input assigns the label “benign tumor” we can calculate the accuracy as:

We have 90 % of accuracy, a really good model, right? in this particular problem, the model says that all malignant tumors are benign, and this can be a big problem. But how can we know if our model is good or bad?

Positive and Negative labels.

The scenario presented before is a clear example of an unbalanced classification problem when we have a dataset with a different number of instances per class. Due to this, we need to use other kinds of metrics to evaluate the model performance. But first, we need to establish over what label we want to know if its classification is good or bad. To do this, in binary classification problems we used the terms positive and negative labels.

According to this definition, for binary classification problems, we can encounter the following outputs.

True Positive (TP): A true positive output is an instance from the positive class that was classified as positive. This is one correct prediction.
True Negative (TN): A true negative output is an instance from the negative class that was classified as negative. This is one correct prediction.
False Positive (FP): A false positive output is an instance from the Negative class that was classified as Positive.
False Negative (FN): A false negative output is an instance from the Positive class that was classified as Negative.

Defining Accuracy, Precision and Recall in terms of TP, TN, FP and FN.

Now let’s express accuracy in terms of both Positive and Negative predictions. The equation is:

Precision

Let’s say we want to know how many mistakes the model makes predicting positive labels, in this kind of scenario we can measure the ratio between the number of correct predictions and the total of positive predictions. Thus we have the definition of Precision.

So, models with high precision will make few mistakes predicting the positive label, that is, the number of False Positive outputs tend to be close to zero. In the extreme scenario where the model is 100 % precise we can trust that all the instances predicted as positive are, in fact, the positive class. Let’s see a simple example of a model with high precision.

In the image shown above, we can see the most simple model that we can build, a simple line, this line separates the Positive instances (blue) from the Negative instances (red). Let’s calculate the precision for this simple model.

True Positive: we have 2 positive instances classified as positive instances, then TP =2
True Negative: We have 6 negative instances classified as negative instances, then TN = 6.
False Positive: There is any negative instance classified as positive, so in this case FP = 0
False Negative: We can see on the left side 2 positive instances classified as negative, so the FN = 2

The model shown above can classify positive instances correctly, however, we can encounter some positive instances classified as negative, thus the model cannot identify all the positive instances. We might have a high number of False Negatives.

Recall

But what happens with False Negative Outputs? Is one model with high precision able to find all the positive instances? probably not. So, how can we measure the ability to find all the positive instances?. If we want to measure how well the model identifies positive instances, we have to take into account all the positive instances in the model output, this implies considering both the True Positive and False Negative instances. Thus, we can define the Recall metric as:

If the model can find all the positive instances, the False Negative outputs tend to be close to 0, then a model with high recall can identify all the instances belonging to the positive class.

Now in the image, we have a model with high recall, let’s calculate both recall and precision for this example.

True Positive: In this case we have 4 instances from the positive class classified as positive instances, then TP = 4
True Negative: We have 2 instances from the negative class classified as negative instances, then TN = 2
False Positive: On the right side of the line we can find 4 instances from the negative class classified as positive, then FP = 4
False Negative: On the other hand, the left side does not have any instance from the positive class, then FN = 0

The model can find all the positive instances, nonetheless, it makes some mistakes classifying negative instances as positive ones.

Which is better, presenting the F-score?

So, how can we know if our model is doing the task good or bad? how can we compare this model with others?, To answer these questions we can use a metric that combines both recall and precision. This metric is called F1-Score and is defined as:

The F1-Score penalizes both low precision and recall, thus in models with high F1-score we’ll have high precision and high recall, however this is not frequent.

We can use the last equation when both recall and precision are equally important, but if we need to give more importance to one specific metric we can use the following equation, which is the general F-Score definition.

This more general equation uses a positive real factor beta that is chosen such that recall is considered beta times as important as precision. The two most common values for beta are 2, which weighs recall higher than precision, and 0.5, which weighs recall lower than precision.

It’s all about context.

For every problem that we are dealing with, we need to pay special attention to the problem context. In this way, we might prefer a model with high precision rather than one with high recall or vice versa. For example, let’s consider again our tumor classifier, in this scenario we want to identify all the possible malignant tumors (high recall) even if this implies making some mistakes and classifying some benign tumors as malignant tumors.

Let’s consider this situation, if one patient has a benign tumor and by error is detected as a malignant tumor, it is probably that this patient will obtain more medical attention, and eventually this error will be detected. Conversely, if we classify a malignant tumor as a benign tumor, this patient maybe won’t have more medical attention until the consequences of this mistake be evident and perhaps too late to fix this error.

Conclusions

In this post, I talked about the importance of considering other metrics to evaluate models, beyond the simple definition of accuracy, and also we learned that we have to pay attention to the context. The problem can give us clues to decide what metric can be better according to the specific problem that we are trying to solve.

However, in real life we can find scenarios where we need to consider more than two classes, these problems are called multi-class classification, and in these cases, we don’t have Positive or Negative instances, nonetheless, we can make some assumptions to deal with this kind of problems.

I am passionate about data science and like to explain how these concepts can be used to solve problems in a simple way. If you have any questions or just want to connect, you can find me on Linkedin or email me at manuelgilsitio@gmail.com

References

Don’t forget to give us your 👏 !

<a href="https://medium.com/media/7078d8ad19192c4c53d3bf199468e4ab/href">https://medium.com/media/7078d8ad19192c4c53d3bf199468e4ab/href</a>

Evaluating classification models. Accuracy, Precision and Recall. was originally published in Chatbots Life on Medium, where people are continuing the conversation by highlighting and responding to this story.

Evaluating classification models. Accuracy, Precision and Recall.

The most common metric.

Positive and Negative labels.

Trending Bot Articles:

Defining Accuracy, Precision and Recall in terms of TP, TN, FP and FN.

Precision

Recall

Which is better, presenting the F-score?

It’s all about context.

Conclusions

References

Don’t forget to give us your 👏 !

More posts

TTS Latenz Benchmark 2025: Google vs. Microsoft Voices für Phonebots

What’s the Funniest Thing an AI Chatbot Has Ever Said to You?

Recommend your best AI chatbot for 2025

Chatbots : Conventional and unconventional uses