There are various evaluation metrics to test the model we trained when conducting machine learning projects.
Especially, F1 score is one of the most popular methods for classification tasks and how to calculate it is wellknown for binary classification or multiclass classification.
However, I have felt that when it comes to multilabel problems in which the golden answers are not necessarily a single class, it is quite confusing to understand which average method should be applied to get a proper F1 value.
Therefore, this post is the simple wrapup for several F1 average methods in multilabel classification tasks including actual implementations and examples using the scikitlearn’s default metric functions[1].
Let us begin.
Multilabel classification
First, let’s see what multilabel classification is briefly again.
The multilabel classification refers to a problem in which one instance can be grouped into not only a single class, but also into multiple classes redundantly.
Therefore, when we solve this kind of task, we should consider all possibilities of being in each class, not just choose one class which has the highest probability value.
To do this, in multilabel problems, both model predictions and actual golden labels are represented as lists or vectors which contain a binary value, $0$ or $1$, at each class index.
For example, if we assume that the total number of classes is $4$, then each class index can be $0$, $1$, and $2$.
And if an instance can be both in class $0$ and class $1$, the label of it becomes [1, 1, 0]
.
Likewise, if both classes $0$ and $2$ are available, then the vector becomes [1, 0, 1]
, and if the instance can exist in all classes, then the label can be written as [1, 1, 1]
.
With this, we can compare with the prediction from a model which is instructed to make a vector of the same size(the number of classes) by toggling certain index as $1$ if the probability of this index is higher than the predefined threshold or leaving it as $0$ if the probability is lower than the threshold.
The description below might help you to understand the overall concept.
Accuracy
In the above example, how can we apply a proper standard to evaluate the model’s performance?
If we assume that we calculate the accuracy score by counting the number of instances of which the label and the prediction are the same, the value becomes $0.25$ since there is only one instance perfectly correct among $4$ samples.
With accuracy_score
function provided by scikitlearn, we can evaluate the model in the same way above.
from sklearn import accuracy_score
labels = [[0,1,1],[1,1,1],[0,1,0],[1,0,1]]
preds = [[0,1,1],[1,1,0],[1,0,1],[0,1,1]]
print(accuracy_score(labels, preds)) # 0.25
This is simple and intuitive, but is this a right method?
What about a situation like this?
from sklearn import accuracy_score
labels = [[0,1,1],[1,1,1],[0,1,0],[1,0,1]]
preds_0 = [[0,1,1],[1,1,0],[1,0,1],[0,1,1]]
preds_1 = [[0,1,1],[1,1,0],[1,1,0],[1,1,1]]
print(accuracy_score(labels, preds_0)) # 0.25
print(accuracy_score(labels, preds_1)) # 0.25
As you can see, two different models gave two different predictions, pred_0
and pred_1
, and both accuracy scores are $0.25$.
However, if you look at the results more specifically, pred_1
is a little bit better since it predicted $3$rd and $4$th instance closer to the labels than pred_0
did.
Since the latter model is relatively more competent, it is not entirely fair to test the models with accuracy.
Therefore, in multilabel situations, we need a more proper evaluation standard.
Micro? Macro? Samples? Weighted?
F1 score is another metric usually used as an evaluation metric for classification tasks.
The score itself is already wellknown, so I will omit the details of it.
We are gonna look at different average methods for F1 score in the multiclass classification situation.
 Micro average: When we count TP(True Positive), TN(True Negative), FP(False Positive), FN(False Negative), we do not separate each class and calculate them only by checking whether a prediction is right or wrong.
 Macro average: After calculating the scores of each class, we take the average of them at the end at once.
 Samples average: (In multilabel classification) First, we get the scores based on each instance and then take the average of all instances at the end.
 Weighted average: This is the same as macro average. The only difference is the weight of each class can be different when we take the average.
Actually, these are also wellknown to people who are interested in Machine Learning.
But in multilabel circumstances, some might be confused with how to apply these average approaches.
Let’s take a look at them step by step with examples. (I will not handle weighted average since it is not different from macro average.)
Micro average
Back to the previous example, let’s calculate F1 score with micro average.
First, the confusion matrix is represented as follows.
As you can see, the matrix values are filled by considering each class not separately.
Therefore, we can get precision, recall and F1 score like below.
$\text{Precision}=\frac{\text{TP}}{\text{TP} + \text{TN}}=\frac{5}{5+3}=\frac{5}{8}$
$\text{Recall}=\frac{\text{TP}}{\text{TP} + \text{FN}}=\frac{5}{5+3}=\frac{5}{8}$
$\text{F1}=\frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}=\frac{5}{8}$
And we can also get the same results with actual implementation.
from sklearn import F1_score, precision_score, recall_score
labels = [[0,1,1],[1,1,1],[0,1,0],[1,0,1]]
preds = [[0,1,1],[1,1,0],[1,0,1],[0,1,1]]
print(precision_score(labels, preds, average='micro')) # 0.625
print(recall_score(labels, preds, average='micro')) # 0.625
print(F1_score(labels, preds, average='micro')) # 0.625
Macro average
Next is macro average.
As above, we can construct confusion matrices of each class as follows.
This time, each confusion matrix exists for calculating the score of each class.
If you look at the values, you can see that I counted only in each class, excluding other values at different class index.
Now, let’s get the scores.

Class $0$
$\text{Precision}=\frac{\text{TP}}{\text{TP} + \text{TN}}=\frac{1}{1+1}=\frac{1}{2}$
$\text{Recall}=\frac{\text{TP}}{\text{TP} + \text{FN}}=\frac{1}{1+1}=\frac{1}{2}$
$\text{F1}=\frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}=\frac{1}{2}$

Class $1$
$\text{Precision}=\frac{\text{TP}}{\text{TP} + \text{TN}}=\frac{2}{2+1}=\frac{2}{3}$
$\text{Recall}=\frac{\text{TP}}{\text{TP} + \text{FN}}=\frac{2}{2+1}=\frac{2}{3}$
$\text{F1}=\frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}=\frac{2}{3}$

Class $2$
$\text{Precision}=\frac{\text{TP}}{\text{TP} + \text{TN}}=\frac{2}{2+1}=\frac{2}{3}$
$\text{Recall}=\frac{\text{TP}}{\text{TP} + \text{FN}}=\frac{2}{2+1}=\frac{2}{3}$
$\text{F1}=\frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}=\frac{2}{3}$
Finally, by averaging total scores, we can have the F1 score using macro average.
$\text{Precision}=\frac{\frac{1}{2} + \frac{2}{3} + \frac{2}{3}}{3}=\frac{11}{18}$
$\text{Recall}=\frac{\frac{1}{2} + \frac{2}{3} + \frac{2}{3}}{3}=\frac{11}{18}$
$\text{F1}=\frac{\frac{1}{2} + \frac{2}{3} + \frac{2}{3}}{3}=\frac{11}{18}$
Also, we can get the same results if we run the below codes.
from sklearn import F1_score, precision_score, recall_score
labels = [[0,1,1],[1,1,1],[0,1,0],[1,0,1]]
preds = [[0,1,1],[1,1,0],[1,0,1],[0,1,1]]
print(precision_score(labels, preds, average='macro')) # 0.611111111111111
print(recall_score(labels, preds, average='macro')) # 0.611111111111111
print(F1_score(labels, preds, average='macro')) # 0.611111111111111
Samples average
The last one is samples averaging.
If you are not familiar with multilabel classification, you might have never heard of it.
As I mentioned above, samples average method takes the average of scores from each sample after calculating the score based on each sample.
So it can be defined only in multilabel situation where each sample has own class distributions.
Let’s look at the example again.
This time, we need confusion matrices of each instance and as above you can see that the values are only from the distributions inside of each sample.

Instance $0$
$\text{Precision}=\frac{\text{TP}}{\text{TP} + \text{TN}}=\frac{2}{2+0}=1$
$\text{Recall}=\frac{\text{TP}}{\text{TP} + \text{FN}}=\frac{2}{2+0}=1$
$\text{F1}=\frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}=1$

Instance $1$
$\text{Precision}=\frac{\text{TP}}{\text{TP} + \text{TN}}=\frac{2}{2+0}=1$
$\text{Recall}=\frac{\text{TP}}{\text{TP} + \text{FN}}=\frac{2}{2+1}=\frac{2}{3}$
$\text{F1}=\frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}=\frac{4}{5}$

Instance $2$
$\text{Precision}=\frac{\text{TP}}{\text{TP} + \text{TN}}=\frac{0}{0+2}=0$
$\text{Recall}=\frac{\text{TP}}{\text{TP} + \text{FN}}=\frac{0}{0+1}=0$
$\text{F1}=\frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}=0$

Instance $3$
$\text{Precision}=\frac{\text{TP}}{\text{TP} + \text{TN}}=\frac{1}{1+1}=\frac{1}{2}$
$\text{Recall}=\frac{\text{TP}}{\text{TP} + \text{FN}}=\frac{1}{1+1}=\frac{1}{2}$
$\text{F1}=\frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}=\frac{1}{2}$
And we can get the desired output with the average on sample scores.
$\text{Precision}=\frac{1 + 1 + 0 + \frac{1}{2}}{4}=\frac{5}{8}$
$\text{Recall}=\frac{1 + \frac{2}{3} + 0 + \frac{1}{2}}{4}=\frac{13}{24}$
$\text{F1}=\frac{1 + \frac{4}{5} + 0 + \frac{1}{2}}{4}=\frac{23}{40}$
Again, we can see the same results with the default functions in scikitlearn.
from sklearn import F1_score, precision_score, recall_score
labels = [[0,1,1],[1,1,1],[0,1,0],[1,0,1]]
preds = [[0,1,1],[1,1,0],[1,0,1],[0,1,1]]
print(precision_score(labels, preds, average='samples')) # 0.625
print(recall_score(labels, preds, average='samples')) # 0.5416666666666666
print(F1_score(labels, preds, average='samples')) # 0.575
Additional discussions
So far, we’ve been through different averaging methods for the F1 score in multilabel classification.
It is hard to conclude which metric is the most preferable since depending on what aspect he/she wants to focus on, the proper method can always be different eventually.
But I personally use the “samples average” and “exact accuracy” we saw at the first time, when handling multilabel challenges.
Although micro and macro averages have each advantage in multiclass classification, they assess the model based on classes not on instances, so they might confuse us when the model is biased against a certain class, resulting in score variation.
Especially, macro average is the most vulnerable to this phenomenon, so it can cause high fluctuation if the performance on one class changes.
And in multilabel classification, I think we should focus on how the model predicts the class group in each instance, so using a standard which concentrates on each instance, not each class is more attractive.
Therefore, I prefer to use two instancebased methods, both the soft/averaged one(samples average F1) and the hard/strict one(accuracy) as a support.
Of course this can vary from person to person.
I added several more examples for some people who want to practice more.
Calculate the scores and compare them with the actual results.
from sklearn import *
# Example 1
labels = [[1,0,1],[1,1,1],[1,1,0],[0,1,1],[0,0,1]]
preds = [[1,1,1],[0,0,1],[1,1,1],[1,1,0],[0,0,1]]
print(F1_score(labels, preds, average='micro')) # 0.7
print(F1_score(labels, preds, average='macro')) # 0.6944444444444443
print(F1_score(labels, preds, average='samples')) # 0.72
# Example 2
labels = [[1,0,0,0],[1,1,0,1],[0,1,0,0],[1,1,1,1]]
preds = [[1,1,0,0],[0,0,1,1],[1,0,1,0],[1,1,1,1]]
print(F1_score(labels, preds, average='micro')) # 0.631578947368421
print(F1_score(labels, preds, average='macro')) # 0.6416666666666666
print(F1_score(labels, preds, average='samples')) # 0.5166666666666666
# Example 3
labels = [[1,0,0,1],[0,0,0,1],[1,1,0,0],[1,1,1,1],[1,1,1,0]]
preds = [[1,0,0,1],[0,1,1,1],[0,0,1,1],[0,1,1,0],[1,1,0,0]]
print(F1_score(labels, preds, average='micro')) # 0.6086956521739131
print(F1_score(labels, preds, average='macro')) # 0.6
print(F1_score(labels, preds, average='samples')) # 0.5933333333333334
So this is the end of the post about various averaging methods for F1 score in multilabel classification task.
If there is any opinion or error in the post, please leave a comment below.
Thank you.