Measurement of the accuracy of a binary classification problem

Rahull Trehan
5 min readJul 21, 2021

My previous article was about Confusion Matrix where we discussed the importance of it, how is it read and calculated and what are the implications of having imbalanced data, and how recall and precision values come to the rescue for the evaluation of an imbalanced classification problem.

Starting with an understanding of the following observations:

  1. Accuracy score is not the most appropriate score to evaluate the imbalanced classification problem, and,
  2. Precision or Recall values just give us different insights into our model and both these values are not really providing us with the evaluation of the overall model

Hence we would need some other measure to evaluate the model. Here F-measure or F-score comes to the rescue.

F-score or F-measure

F-score or F-measure, often referred to as F1 Score or Harmonic Mean, is the measure of binary classification accuracy and this score is calculated from the precision and recall values of the test.

F-measure is calculated using the below formula:

One may question why to calculate a harmonic mean and not an arithmetic mean between precision and recall values to measure the score. To answer that I would recommend you to please read through this very interesting explanation, but just to give you a crux, Harmonic Mean takes care of the difference of ratios between the values for which we are calculating the mean. In this case, since we are calculating the mean between precision and recall values, the harmonic mean will penalize the score if we have ratio difference between the two.

To understand this in a better way let us look at how F-measure or harmonic mean differs from the arithmetic mean for different precision and recall values:

What the above assumptions show us that for different ratios of precision and recall values the harmonic mean is lower than the arithmetic mean. We will now use this understanding in an example to get more clarity on this.

Continuing with our previous example of identifying the patients who are likely to have cancer. Let us assume our model is not simply returning a single outcome but is predicting some Positives as well and the confusion matrix of our model now looks a bit different.

We can see that our accuracy score is still 94.8%, but due to the fact that our data is highly imbalanced, let’s see how F-score or F-measure evaluates this model.

Putting in the values from our model into the formula to calculate the F-score:

This proves how effective is the F-score for such imbalanced data, even if we have an accuracy score of 94.8% our F-score for the same model is just 23.53% which is not good at all.

Let's quickly have a look at how precision, recall, and F-score is calculated in python:

## Calculating precision score
from sklearn.metrics import precision_score
precision = precision_score(y_true, y_pred)
## Calculating recall score
from sklearn.metrics import recall_score
recall = recall_score(y_true, y_pred)
## calculating f1 score (harmonic mean)
from sklearn.metrics import f1_score
score = f1_score(y_true, y_pred)

You might want to argue that despite the fact we were able to get a recall score of 80% which is very good for our problem (as our intention is to have minimum false negative predictions) having an F-score of just 23.53% does not make sense. In order to justify such scenarios, we need to have a provision to manipulate the F-score or F-measure values in order to better evaluate the model. Here, I would like to introduce the concept of F-beta score and we will also see how this F-beta score helps us in manipulating the evaluation score of the model.

F-beta score

The idea behind F-beta score is to manipulate the F-score basis the type of problem we have at hand. In case if we are working on a problem where we want to give more weightage to either of the values, that is precision or recall, we can vary the beta value accordingly and this will eventually result in a better evaluation score.

F-beta is calculated using the below formula:

The default value of beta is 1 and if we put this in the above formula we will derive the formula of F-measure or F-score or Harmonic Mean.

Conceptually, if we want to give more weightage to Precision and less to Recall we have to lower the beta, and on the other hand if we intend to give more weightage to recall and less to precision then we have to increase the beta value. Just remember the below visualization to memorize this concept.

Applying the above concept in our cancer prediction problem, since we are intending to ensure that we have lesser False Negative predictions or we can also say we want to focus more on the recall value. Let’s try and calculate the F-beta score for a beta value of 5, that is, giving more weightage to recall value.

We can observe that the model evaluation score is improved from 23.53% to 67.53% since we gave more weightage to the recall score and this is how we manipulate the F-measure according to the type of problem we have at hand.

Let’s quickly have a look at how F-beta score is calculated in python:

## calculate the f-beta score
from sklearn.metrics import fbeta_score
f_0.5 = fbeta_score(y_true, y_pred, beta=0.5)
f_5 = fbeta_score(y_true, y_pred, beta=5)

Hope this article really helped you with a better understanding of F-measure, Harmonic Mean or F1-score, and F-beta score. Thank you so much for reading.

--

--