Accuracy is dead: Measuring, prejudice, and other metrics needing actually

nimda July 15, 2025

0 5 5 minutes read

Accuracy is dead: Measuring, prejudice, and other metrics needing actually

We, Data Scientists, very much – but also mislead.

It was old to find that models were more developed than simply predicting. We build models to make decisions, and that requires trust. And relying accurately simply enough.

In this post, we will see why and we will examine other ways, making progressively improving and compliant with our needs. As regularly, we will follow a practical approach, with the purpose of ending deep deductions in non-standard metric test.

Here is a list of context of today's reading:

To set models
Separation: above the accuracy
Restore: Advanced exam
Store

To set models

Accuracy makes an additional sense about the separation of algoriths than included with recycling activities … so, not all problems are equally measured.

For that reason why I decide to deal with both cases – reinstatement and classification of paragraphs – separately by creating two different models.

And will be very easy, because their performance and application is not important today:

To schedule a particular type: Is the striker stere in the next game?
Progress: How many goals will you get a player's points?

If you are a common student, I'm sure the use of examples of soccer hasn't surprised.

Booklet: Whether we cannot use the accuracy of our revision trouble and this post is thought to focus on the metric, I did not want to leave the charges behind. So that's why we will check the various metrics.

Also, because we don't care about data or work, let me make every part of the priority and I understand directly to models itself:

# Classification model
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

# Gradient boosting regressor
model = GradientBoostingRegressor()
model.fit(X_train_scaled, y_train)

As you can see, we stick to simple models: a reasonable restoration of binary alienation, and to grow to resize.

Let us examine the metrics we usually check:

# Classification
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)

print(f"Test accuracy: {accuracy:.2%}")

Printed accuracy is 92.43%, which is the highest way rather than I expected. Is the model really good?

# Regression
y_pred = model.predict(X_test_scaled)

rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"Test RMSE: {rmse:.4f}")

I found RMSE 0.3059. Not that good. But enough to get rid of our returning model?

We need to do better.

Separation: above the accuracy

Many Data Science projects set accurately, which often misleads, especially with deciding stones (eg assaulting the goal rarely).

To check whether our model really Attert “What is this player what do you do?”, Here are some Metrics to Consider:

Roc-AUC: Steps to be able to measure negative benefits with Iamity. No matter the threshold but you don't care about balance.
PR-AUC: Curse-remember curve is important to unusual events (eg a chance of opportunities). It focuses on a beautiful, important class when the postitives are refused.
LOG loss: We punish inexperienced predictions. Ready to match the fiery reprodication results.
Brier Points: Steps mean a limited mistake between predicted opportunities and the real results. Low is better, and translated as complete exchange.
Rate: Viewing view to see if it can predict the likelihood of including frequent frequency.

We will not check ourselves all now, but take a briefly about the loss of ROC-AUC and log, perhaps mostly used after accuracy.

Roc-AUC

Roc-Auc, or Active operating finder – a place under curvesIs a popular metric consisting of rating under the ROC curve, which turns the right amount (TPR) against the actual level (FPR).

Simply put, ROC-AUC (ranging from 0 to 1) summarizing how well the model can produce discriminated among all division restrictions.

0.5 score reflects random speculation and 1 effective.

Computing in in Python is easy:

from sklearn.metrics import roc_auc_score

roc_auc = roc_auc_score(y_test, y_proba)

Here, y_thrue contains real labels and Y_Poba contains our predicted projecticels. In my case score is 0.7585, too low compared with accuracy. But how is this possible, if we have the accuracy above 90%?

City: We are trying to predict that the player will get points or not. “The problem” is that the information is the most memorable: Most players will not strike in the game, so our model reads that predicting is most likely, without learning anything in particular.

You can't take a little paragraph correctly and accurate it is simply no longer.

LOG loss

Logarithmic losses, cross-crossing or, easily, log loss, is used to assess functioning on output results. It measures the difference between predicted opportunities and real (true) prices, logarithmically.

Also, we can do this with one liner in Python:

from sklearn.metrics import log_loss

logloss = log_loss(y_test, y_proba)

As you may have guessed, add value, better. A 0 can be a perfect model. In my story, I got 0.2345.

This is also affected by Class Handlinence: Loss of the unaccompanial predictability of the most cruel and, because our Model predicts 0, such conditions where there was a goal of getting the last points.

Restore: Advanced exam

The accuracy doesn't make it easier but with many exciting metrics to check the problem of how many player goals will earn points in the given game.

When you predict Results Continuous .

Other Metrics and Checks:

²: It represents the average variable of target variables described by the model.

Root: Punishes underground power and helps if the prices varied (eg, sweet points).

MAPE / SMAPE: Percent of errors, but notice the issue of DiDibro-by-zero.

Loss of price: Railroad models for time to predict (eg, 10th, 50th, 50th perCentile results).
VS remains foretold (conspiracy): Look at heteroscedastity.

Also, let's focus on the lower part of them.

R² score

It is also called Coefficient of determination, comparing the error of model in the base error. 1 Score is appropriate, 0 means that foretells the meaning of all, and the amount below 0 means that it is worse than targeted prediction.

from sklearn.metrics import r2_score

r2 = r2_score(y_test, y_pred)

I found 0.0557 value, which is closest to 0 … not good.

Root

This page Coogred Aggared GoogarithMic error, or rmsle, measures the square root of normal common difference between lOG-transformed forecasted and actual values. This metric helps when:

We want to punish for predictions under gentlemen.
Our target variables in Skewed (reduces the impact of the main vendors).

from sklearn.metrics import mean_squared_log_error

rmsle = np.sqrt(mean_squared_log_error(y_test, y_pred))

I got 0.19684 which means my normal predictions can be 0.2 objectives. It's not a big but, given that our target variable is a price between 0 and 4 and is considered a lot at 0 …

Loss of price

The so-called Pinball loss, can be used for Quaible Regression models to check how our predicted prices do. If we create a model with prices (GradatuintAtistrationRessossous for the price losses), we can evaluate:

from sklearn.metrics import mean_pinball_loss

alpha = 0.9
q_loss = mean_pinball_loss(y_test, y_pred_quantile, alpha=alpha)

Here, through the Alpha 0.9 We try to predict 90th perCentile. My price loss is 0.0644 in the smallest words related (~ 1,6% of my unique distance).

However, the matters of submission: most of us y_TEST 0 values, and we need to describe them as “On average, the error of our model in the maximum tail mass is too low“.

It is very impressive given a 0 powerful target.

But, because many results are 0, some metrics such as we have seen and mentioned should be used to check whether our model is actually doing well or not.

Store

Construction-creation models go beyond simply to reach “good accuracy.”

A Members to schedule a particular type Activities, you need to think about mentoral information, possible estimates, and cases of world-world use such as prices or risk management.

A Members progressThe purpose is not only to reduce the error but uncertainty to understand – important if your predictions inform trades or commercial decisions.

Finally, the true value lies:

Carefully selected, valid features.
Advanced test methods are organized into trouble.
Obvious, visible.

If you find this item, you're no longer building a “other model.” It brings strong, decided tools. And Metric Metrics we have been checked here with just login points.

Source link

nimda July 15, 2025

0 5 5 minutes read