Averaging methods for F1 score calculation in multi-label classification

There are various evaluation metrics to test the model we trained when conducting machine learning projects.

Especially, F1 score is one of the most popular methods for classification tasks and how to calculate it is well-known for binary classification or multi-class classification.

However, I have felt that when it comes to multi-label problems in which the golden answers are not necessarily a single class, it is quite confusing to understand which average method should be applied to get a proper F1 value.

Therefore, this post is the simple wrap-up for several F1 average methods in multi-label classification tasks including actual implementations and examples using the scikit-learn’s default metric functions[1].

Let us begin.

Multi-label classification

First, let’s see what multi-label classification is briefly again.

The multi-label classification refers to a problem in which one instance can be grouped into not only a single class, but also into multiple classes redundantly.

Therefore, when we solve this kind of task, we should consider all possibilities of being in each class, not just choose one class which has the highest probability value.

To do this, in multi-label problems, both model predictions and actual golden labels are represented as lists or vectors which contain a binary value, $0$ or $1$, at each class index.

For example, if we assume that the total number of classes is $4$, then each class index can be $0$, $1$, and $2$.

And if an instance can be both in class $0$ and class $1$, the label of it becomes [1, 1, 0].

Likewise, if both classes $0$ and $2$ are available, then the vector becomes [1, 0, 1], and if the instance can exist in all classes, then the label can be written as [1, 1, 1].

With this, we can compare with the prediction from a model which is instructed to make a vector of the same size(the number of classes) by toggling certain index as $1$ if the probability of this index is higher than the pre-defined threshold or leaving it as $0$ if the probability is lower than the threshold.

The description below might help you to understand the overall concept.

The description of a multi-label classification problem.

Accuracy

In the above example, how can we apply a proper standard to evaluate the model’s performance?

If we assume that we calculate the accuracy score by counting the number of instances of which the label and the prediction are the same, the value becomes $0.25$ since there is only one instance perfectly correct among $4$ samples.

With accuracy_score function provided by scikit-learn, we can evaluate the model in the same way above.

from sklearn import accuracy_score

labels = [[0,1,1],[1,1,1],[0,1,0],[1,0,1]]
preds = [[0,1,1],[1,1,0],[1,0,1],[0,1,1]]

print(accuracy_score(labels, preds))  # 0.25

This is simple and intuitive, but is this a right method?

What about a situation like this?

from sklearn import accuracy_score

labels = [[0,1,1],[1,1,1],[0,1,0],[1,0,1]]
preds_0 = [[0,1,1],[1,1,0],[1,0,1],[0,1,1]]
preds_1 = [[0,1,1],[1,1,0],[1,1,0],[1,1,1]]

print(accuracy_score(labels, preds_0))  # 0.25
print(accuracy_score(labels, preds_1))  # 0.25

As you can see, two different models gave two different predictions, pred_0 and pred_1, and both accuracy scores are $0.25$.

However, if you look at the results more specifically, pred_1 is a little bit better since it predicted $3$rd and $4$th instance closer to the labels than pred_0 did.

Since the latter model is relatively more competent, it is not entirely fair to test the models with accuracy.

Therefore, in multi-label situations, we need a more proper evaluation standard.

Micro? Macro? Samples? Weighted?

F1 score is another metric usually used as an evaluation metric for classification tasks.

The score itself is already well-known, so I will omit the details of it.

We are gonna look at different average methods for F1 score in the multi-class classification situation.

Micro average: When we count TP(True Positive), TN(True Negative), FP(False Positive), FN(False Negative), we do not separate each class and calculate them only by checking whether a prediction is right or wrong.
Macro average: After calculating the scores of each class, we take the average of them at the end at once.
Samples average: (In multi-label classification) First, we get the scores based on each instance and then take the average of all instances at the end.
Weighted average: This is the same as macro average. The only difference is the weight of each class can be different when we take the average.

Actually, these are also well-known to people who are interested in Machine Learning.

But in multi-label circumstances, some might be confused with how to apply these average approaches.

Let’s take a look at them step by step with examples. (I will not handle weighted average since it is not different from macro average.)

Micro average

Back to the previous example, let’s calculate F1 score with micro average.

First, the confusion matrix is represented as follows.

The confusion matrix in micro average.

As you can see, the matrix values are filled by considering each class not separately.

Therefore, we can get precision, recall and F1 score like below.

$\text{Precision}=\frac{\text{TP}}{\text{TP} + \text{TN}}=\frac{5}{5+3}=\frac{5}{8}$

$\text{Recall}=\frac{\text{TP}}{\text{TP} + \text{FN}}=\frac{5}{5+3}=\frac{5}{8}$

$\text{F1}=\frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}=\frac{5}{8}$

And we can also get the same results with actual implementation.

from sklearn import F1_score, precision_score, recall_score

labels = [[0,1,1],[1,1,1],[0,1,0],[1,0,1]]
preds = [[0,1,1],[1,1,0],[1,0,1],[0,1,1]]

print(precision_score(labels, preds, average='micro'))  # 0.625
print(recall_score(labels, preds, average='micro'))  # 0.625
print(F1_score(labels, preds, average='micro'))  # 0.625

Macro average

Next is macro average.

As above, we can construct confusion matrices of each class as follows.

The confusion matrix in macro average.

This time, each confusion matrix exists for calculating the score of each class.

If you look at the values, you can see that I counted only in each class, excluding other values at different class index.

Now, let’s get the scores.

Class $0$

$\text{Precision}=\frac{\text{TP}}{\text{TP} + \text{TN}}=\frac{1}{1+1}=\frac{1}{2}$

$\text{Recall}=\frac{\text{TP}}{\text{TP} + \text{FN}}=\frac{1}{1+1}=\frac{1}{2}$

$\text{F1}=\frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}=\frac{1}{2}$
Class $1$

$\text{Precision}=\frac{\text{TP}}{\text{TP} + \text{TN}}=\frac{2}{2+1}=\frac{2}{3}$

$\text{Recall}=\frac{\text{TP}}{\text{TP} + \text{FN}}=\frac{2}{2+1}=\frac{2}{3}$

$\text{F1}=\frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}=\frac{2}{3}$
Class $2$

$\text{Precision}=\frac{\text{TP}}{\text{TP} + \text{TN}}=\frac{2}{2+1}=\frac{2}{3}$

$\text{Recall}=\frac{\text{TP}}{\text{TP} + \text{FN}}=\frac{2}{2+1}=\frac{2}{3}$

$\text{F1}=\frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}=\frac{2}{3}$

Finally, by averaging total scores, we can have the F1 score using macro average.

$\text{Precision}=\frac{\frac{1}{2} + \frac{2}{3} + \frac{2}{3}}{3}=\frac{11}{18}$

$\text{Recall}=\frac{\frac{1}{2} + \frac{2}{3} + \frac{2}{3}}{3}=\frac{11}{18}$

$\text{F1}=\frac{\frac{1}{2} + \frac{2}{3} + \frac{2}{3}}{3}=\frac{11}{18}$

Also, we can get the same results if we run the below codes.

from sklearn import F1_score, precision_score, recall_score

labels = [[0,1,1],[1,1,1],[0,1,0],[1,0,1]]
preds = [[0,1,1],[1,1,0],[1,0,1],[0,1,1]]

print(precision_score(labels, preds, average='macro'))  # 0.611111111111111
print(recall_score(labels, preds, average='macro'))  # 0.611111111111111
print(F1_score(labels, preds, average='macro'))  # 0.611111111111111

Samples average

The last one is samples averaging.

If you are not familiar with multi-label classification, you might have never heard of it.

As I mentioned above, samples average method takes the average of scores from each sample after calculating the score based on each sample.

So it can be defined only in multi-label situation where each sample has own class distributions.

Let’s look at the example again.

The confusion matrix in samples average.

This time, we need confusion matrices of each instance and as above you can see that the values are only from the distributions inside of each sample.

Instance $0$

$\text{Precision}=\frac{\text{TP}}{\text{TP} + \text{TN}}=\frac{2}{2+0}=1$

$\text{Recall}=\frac{\text{TP}}{\text{TP} + \text{FN}}=\frac{2}{2+0}=1$

$\text{F1}=\frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}=1$
Instance $1$

$\text{Precision}=\frac{\text{TP}}{\text{TP} + \text{TN}}=\frac{2}{2+0}=1$

$\text{Recall}=\frac{\text{TP}}{\text{TP} + \text{FN}}=\frac{2}{2+1}=\frac{2}{3}$

$\text{F1}=\frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}=\frac{4}{5}$
Instance $2$

$\text{Precision}=\frac{\text{TP}}{\text{TP} + \text{TN}}=\frac{0}{0+2}=0$

$\text{Recall}=\frac{\text{TP}}{\text{TP} + \text{FN}}=\frac{0}{0+1}=0$

$\text{F1}=\frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}=0$
Instance $3$

$\text{Precision}=\frac{\text{TP}}{\text{TP} + \text{TN}}=\frac{1}{1+1}=\frac{1}{2}$

$\text{Recall}=\frac{\text{TP}}{\text{TP} + \text{FN}}=\frac{1}{1+1}=\frac{1}{2}$

$\text{F1}=\frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}=\frac{1}{2}$

And we can get the desired output with the average on sample scores.

$\text{Precision}=\frac{1 + 1 + 0 + \frac{1}{2}}{4}=\frac{5}{8}$

$\text{Recall}=\frac{1 + \frac{2}{3} + 0 + \frac{1}{2}}{4}=\frac{13}{24}$

$\text{F1}=\frac{1 + \frac{4}{5} + 0 + \frac{1}{2}}{4}=\frac{23}{40}$

Again, we can see the same results with the default functions in scikit-learn.

from sklearn import F1_score, precision_score, recall_score

labels = [[0,1,1],[1,1,1],[0,1,0],[1,0,1]]
preds = [[0,1,1],[1,1,0],[1,0,1],[0,1,1]]

print(precision_score(labels, preds, average='samples'))  # 0.625
print(recall_score(labels, preds, average='samples'))  # 0.5416666666666666
print(F1_score(labels, preds, average='samples'))  # 0.575

Additional discussions

So far, we’ve been through different averaging methods for the F1 score in multi-label classification.

It is hard to conclude which metric is the most preferable since depending on what aspect he/she wants to focus on, the proper method can always be different eventually.

But I personally use the “samples average” and “exact accuracy” we saw at the first time, when handling multi-label challenges.

Although micro and macro averages have each advantage in multi-class classification, they assess the model based on classes not on instances, so they might confuse us when the model is biased against a certain class, resulting in score variation.

Especially, macro average is the most vulnerable to this phenomenon, so it can cause high fluctuation if the performance on one class changes.

And in multi-label classification, I think we should focus on how the model predicts the class group in each instance, so using a standard which concentrates on each instance, not each class is more attractive.

Therefore, I prefer to use two instance-based methods, both the soft/averaged one(samples average F1) and the hard/strict one(accuracy) as a support.

Of course this can vary from person to person.

I added several more examples for some people who want to practice more.

Calculate the scores and compare them with the actual results.

from sklearn import *

# Example 1
labels = [[1,0,1],[1,1,1],[1,1,0],[0,1,1],[0,0,1]]
preds = [[1,1,1],[0,0,1],[1,1,1],[1,1,0],[0,0,1]]

print(F1_score(labels, preds, average='micro'))  # 0.7
print(F1_score(labels, preds, average='macro'))  # 0.6944444444444443
print(F1_score(labels, preds, average='samples'))  # 0.72


# Example 2
labels = [[1,0,0,0],[1,1,0,1],[0,1,0,0],[1,1,1,1]]
preds = [[1,1,0,0],[0,0,1,1],[1,0,1,0],[1,1,1,1]]

print(F1_score(labels, preds, average='micro'))  # 0.631578947368421
print(F1_score(labels, preds, average='macro'))  # 0.6416666666666666
print(F1_score(labels, preds, average='samples'))  # 0.5166666666666666


# Example 3
labels = [[1,0,0,1],[0,0,0,1],[1,1,0,0],[1,1,1,1],[1,1,1,0]]
preds = [[1,0,0,1],[0,1,1,1],[0,0,1,1],[0,1,1,0],[1,1,0,0]]

print(F1_score(labels, preds, average='micro'))  # 0.6086956521739131
print(F1_score(labels, preds, average='macro'))  # 0.6
print(F1_score(labels, preds, average='samples'))  # 0.5933333333333334

So this is the end of the post about various averaging methods for F1 score in multi-label classification task.

If there is any opinion or error in the post, please leave a comment below.

Thank you.

[1] 3.3 Metrics and scoring: quantifying the quality of predictions. https://scikit-learn.org/stable/modules/model_evaluation.html.

Categories