What is F1 Score in machine learning?

0 1 7 minutes read

In machine learning and data science, model testing is as important as building it. Accuracy is often the first metric people use, but it can be misleading if the data is inconsistent. For this reason, metrics such as precision, recall, and F1 score are widely used. This article focuses on the F1 result. It explains what the F1 score is, why it is important, how it is calculated, and when it should be used. This article includes a working Python example using scikit-learn and discusses common mistakes to avoid during model testing.

What is the F1 Score in Machine Learning?

The F1 score, also known as the residual F-score or F-measure, is a metric used to evaluate a model by combining precision and recall into a single value. It is often used in classification problems, especially when the data is unbalanced or when false positives and false negatives are important.

Accuracy measures how many good conditions are predicted to be correct. In simple terms, it answers the question: of all the good scenarios predicted, how many are correct. Recall, also called sensitivity, measures how many true positive cases are correctly identified by the model. It answers the question: for all good real cases, how many models have been found.

Accuracy and recall often have a tradeoff. Improving one can reduce the other. The F1 score addresses this by using harmonic interpretation, which gives more weight to lower values. As a result, the F1 score is high only if both precision and recall are high.

F1 = 2 ×

Accuracy × Recall

Accuracy + Recall

The F1 score ranges from 0 to 1, or from 0 to 100%. A score of 1 indicates perfect accuracy and recall. A score of 0 indicates zero precision or recall, or both. This makes the F1 score a reliable metric for evaluating classification models.

Also Read: 8 Ways to Improve the Accuracy of Machine Learning Models

When Should You Use F1 Score?

If accuracy alone cannot provide a clear picture of model performance, F1 scores are used. This happens mostly with missing data. The model may be more accurate in such cases, only by making predictions about the majority of the class. However, it cannot target small groups at all. The F1 score is useful in solving this issue because it takes into account precision and recall.

The F1 score comes in handy when there are significant false positives and false negatives. It gives a single value when the model balances these two categories of errors. To have high F1 results in a model, it must perform well in accuracy and recall. This makes it more reliable than accurate in many real-world tasks.

Real World Use Cases for F1 Score

The F1 score is often used in the following situations:

Unequal classification problems such as spam filtering, fraud detection, and medical diagnosis.
Information retrieval and search systems, where useful results should be placed with a minimum number of false positives.
Model refinement or thresholding, where both precision and recall are important.

If one type of fault is more expensive than another, that type of fault should not be used independently of the F1 score. Remembering can be especially important when it's too bad to miss a good situation. When false alarms are worst, accuracy can be a high point of attention. If accuracy and recall are of equal importance, the F1 score is the most appropriate.

How to Calculate F1 Score Step by Step

An F1 score can be calculated once precision and recall are known. These metrics are derived from the confusion matrix in the binary classification problem.

Accuracy measures how many good conditions are predicted to be correct. It is defined as:

Accuracy =

TP + FP

Recall is used to determine the number of true positives that are returned. It is defined as:

Remember =

TP + FN

Here, TP represents true positives, FP represents false positives, and FN represents false negatives.

F1 Score Formula Using Accuracy and Recall

After knowing the precision (P) and recall (R), the F1 score can be determined as the harmonic mean of both:

F1 =

2 × P × R

P + R

The harmonic mean gives more weight to smaller values. As a result, the F1 score is dragged down by precision or recall. For example, if precision is 0.90 and recall is 0.10, the F1 score is approximately 0.18. If both precision and recall are 0.50, the F1 score is also 0.50.

This ensures that high F1 scores are obtained only when both precision and recall are high.

F1 Score Formula Using Confusion Matrix

One can rewrite the same formula using confusion matrix terms:

F1 =

2 TP

2 TP + FP + FN

Considering an example, where the model is characterized by a precision of 0.75 and a recall of 0.60, the F1 score is:

F1 =

2 × 0.75 × 0.60

0.75 + 0.60

=
0.90
/
1.35
≈
0.67

For multi-class classification problems, the F1 score is calculated separately for each class and averaged. A grand average treats all classes equally, while a weighted average accounts for class frequency. For highly unequal datasets, weighted F1 is generally a better overall metric. Always check the calibration method when comparing model performance.

Using F1 Score in Python using scikit-learn

An example of binary classification is as follows. Precision, recall, and F1 score will be calculated with the help of scikit-learn. This helps to show how these metrics work.

First, deliver the necessary jobs.

from sklearn.metrics import precision_score, recall_score, f1_score, classification_report

Now, describe the true labels and model predictions for the ten samples.

# True labels 
y_true = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]   # 1 = positive, 0 = negative 
 
# Predicted labels 
y_pred = [1, 0, 1, 1, 0, 0, 0, 1, 0, 0]

Next, combine the accuracy, recall, and F1 score for a good class.

precision = precision_score(y_true, y_pred, pos_label=1) 
recall = recall_score(y_true, y_pred, pos_label=1) 
f1 = f1_score(y_true, y_pred, pos_label=1) 
 
print("Precision:", precision) 
print("Recall:", recall) 
print("F1 score:", f1)

You can also generate a full classification report.

print ("nClassification Report:n", classification_report(y_true, y_pred))

Running this code produces output like the following:

Precision: 0.75
Recall: 0.6
F1 score: 0.6666666666666666

Classification Report:

Classification Report:
              precision    recall  f1-score   support

           0       0.67      0.80      0.73         5
           1       0.75      0.60      0.67         5

    accuracy                           0.70        10
   macro avg       0.71      0.70      0.70        10
weighted avg       0.71      0.70      0.70        10

Understanding Classification Report Output in scikit-learn

Let's interpret these results.

In the good category (label 1), the accuracy is 0.75. This means that three-quarters of the samples they said they had were positive. The recall is 0.60 indicating that the model identified 60% of all the true samples correctly. If these two values are added, the result is an F1 value of 0.67.

In the case of the negative category (label 0), the recall is maximum at 0.80. This shows that the model is more effective in detecting negativism than positivism. Its accuracy is 70% overall, which is not a measure of the success of the model in each of the different categories.

This can be easily viewed in the classification report. It displays precision, recall, and F1 by class, macros, and weighted averages. In this limited case, the maximum score and weight of F1 are compared. Weighted F1 scores on highly unbalanced datasets overemphasize the dominant class.

This is illustrated with a practical example of computing and interpreting F1 scores. The F1 score on validation/test data from real projects will be used to determine the balance of false positives and false negatives that will match your model.

Best Practices and Common Pitfalls in Using F1 Score

Choose F1 based on your purpose:

F1 is used when recall and precision are equally important.
There is no need to use F1 if one type of error is too expensive.
Use weighted F scores where needed.

Don't rely on F1 alone:

F1 is a composite metric.
It hides the balance between accuracy and recall.
Always review accuracy and recall separately.

Handle class imbalances carefully:

F1 performs well in terms of accuracy when dealing with unbalanced data.
Measurement methods affect final scores.
Macro F1 treats all classes equally.
Weighted F1 favors regular classes.
Choose a method that reflects your goals.

View zero or no predictions:

F1 can be zero if the class is never predicted.
This may indicate a model or data problem.
Always check the confusion matrix.

Use F1 wisely in model selection:

F1 works well for comparing models.
Small differences may not be meaningful.
Combine F1 with domain knowledge and other metrics.

The conclusion

The F1 score is a robust metric for evaluating classification models. It combines precision and recall into a single value and is particularly useful when both types of errors matter. It works best for problems with unbalanced data.

Unlike accuracy, F1 scores highlight weaknesses that accuracy can hide. This article explained what the F1 score is, how it is calculated, and how to interpret it using Python examples.

The F1 score should be used with caution, as with any evaluation metric. It works best when accuracy and recall are equally important. Always choose evaluation metrics based on your project goals. When used in the right context, the F1 score helps build balanced and reliable models.

Frequently Asked Questions

Q1. Is an F1 score of 0.5 correct?

A. An F1 score of 0.5 indicates average performance. It means that the model balances precision with poor recall and is usually only acceptable as a baseline, especially for non-parametric data sets or early-stage models.

Q2. What is a good F1 score?

A. A good F1 score depends on the problem. In general, a score above 0.7 is considered decent, above 0.8 strong, and above 0.9 very poor, especially in class inequality tasks.

Q3. Is lower F1 better?

A. No. Lower F1 scores indicate worse performance. Since F1 combines precision and recall, a higher value always means that the model makes fewer false positives and false negatives overall.

Q4. Why is the F1 effect used in ML?

A. The F1 score is used when there is class imbalance or when both false positives and false negatives are significant. It provides a single metric that balances precision and recall, as opposed to precision, which can be misleading.

Q5. Is 80% accuracy good for machine learning?

A. 80% accuracy can be good or bad depending on the context. For balanced datasets it may be acceptable, but for imbalanced problems, high precision can mask poor performance in small classes.

Q6. Should I use precision or F1 score?

I. Apply accuracy to balanced datasets where all errors are equally important. Use the F1 score when dealing with class imbalances or when precision and recall are more important than overall fairness.

Hi, I'm Janvi, a data science enthusiast currently working at Analytics Vidhya. My journey into the world of data began with a deep curiosity about how to extract valuable insights from complex datasets.