Machine Learning

Without ROC-AUC and KS: Gini Coefflenty, just described

The metric discussed are classified as a ROC-AUC and Kolmogorov-smirnov (KS) statistics in previous blogs.

In these blogs, we will check another important metric of divorce called Gini coeffiriff.


Why do we have multiple classes?

Every metric classion tells the model performance from a different agile. We know that the ROC-AUC provides the high potential of the model status, and the KS figure shows us where the gap between two groups occurs.

When it comes to gini coefficient, it tells us how much our model is rather than a random speculation in observing higher position than the wrong things.


First, let's see how Gini Coeffler is calculated.

In this case, we also use German credit data.

Let's use the same sample data that often understood the Calculated City of Kolmogorov-Smyrnov (KS) Statistics.

Photo by the writer

This sample data was received by using logical refund from German credit dataset.

As the results of model model opportunities, we have selected a sample of 10 points in those opportunities to show Gine Coefficiliral calculation.

Calculation

Step 1: Edit data for foretold opportunities.

The sample data is already organized by predicting the chances.

Step 2: Enter the unusual number of people and gross policies.

The amount that is compiled: A compound number of records regarded as the building.

The population (%): Percentage of perfect people covered so far.

Accumulating Police: How many positsites are real (Section 2) we have seen in this point.

CRelated Cools (%): Percentage of exported positives so far.

Photo by the writer

Step 3: X and Y Prices

X = joint value (%)

Y = positives (%)

Here, let us use Python to organize these values ​​for X and Y.

Code:

import matplotlib.pyplot as plt

X = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
Y = [0.0, 0.25, 0.50, 0.75, 0.75, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00]

# Plot curve
plt.figure(figsize=(6,6))
plt.plot(X, Y, marker='o', color="cornflowerblue", label="Model Lorenz Curve")
plt.plot([0,1], [0,1], linestyle="--", color="gray", label="Random Model (Diagonal)")
plt.title("Lorenz Curve from Sample Data", fontsize=14)
plt.xlabel("Cumulative Population % (X)", fontsize=12)
plt.ylabel("Cumulative Positives % (Y)", fontsize=12)
plt.legend()
plt.grid(True)
plt.show()

Conspiracy:

Photo by the writer

The curve we receive when planning the number of people working (%) and collective purposes (%) are called Lorenz Curve.

Step 4: Calculate the area under Lorenz turn.

When we discuss ROC-AUC, we found a place under the curve using a trapezoid formula.

Each region between two points were treated as a trapzoid, its location was calculated, and all the areas were added together to find the final amount.

The same method is used here to calculate the location under the Center of Lorenz.

Location under Lorenz turn

Trapezoid area:

$$
Text {Location} = frac {1} {2} Times (y_1 + 2) X_2 – x_1)
$$

From (0.0, 0.0) to (0.1, 0.25):
[
A_1 = frac{1}{2}(0+0.25)(0.1-0.0) = 0.0125
]

From (0.1, 0.25) to (0.2, 0.50):
[
A_2 = frac{1}{2}(0.25+0.50)(0.2-0.1) = 0.0375
]

From (0.2, 0.50) to (0.3, 0.75):
[
A_3 = frac{1}{2}(0.50+0.75)(0.3-0.2) = 0.0625
]

From (0.3, 0.75) to (0.4, 0.75):
[
A_4 = frac{1}{2}(0.75+0.75)(0.4-0.3) = 0.075
]

From (0.4, 0.75) to (0.5, 1.00):
[
A_5 = frac{1}{2}(0.75+1.00)(0.5-0.4) = 0.0875
]

From (0.5, 1.00) to (0.6, 1.00):
[
A_6 = frac{1}{2}(1.00+1.00)(0.6-0.5) = 0.100
]

From (0.6, 1.00) to (0.7, 1.00):
[
A_7 = frac{1}{2}(1.00+1.00)(0.7-0.6) = 0.100
]

From (0.7, 1.00) to (0.8, 1.00):
[
A_8 = frac{1}{2}(1.00+1.00)(0.8-0.7) = 0.100
]

From (0.8, 1.00) to (0.9, 1.00):
[
A_9 = frac{1}{2}(1.00+1.00)(0.9-0.8) = 0.100
]

From (0.9, 1.00) to (1.0, 1.00):
[
A_{10} = frac{1}{2}(1.00+1.00)(1.0-0.9) = 0.100
]

Complete location under Lorenz Curve:
[
A = 0.0125+0.0375+0.0625+0.075+0.0875+0.100+0.100+0.100+0.100+0.100 = 0.775
]

It lists the area under the Lorenz turn, 0.775.

Here, we have organized the population (%) and the objectives of the (%) goals, and we can see that the site below this end shows that posters (section 2) How quickly drafted the programmed list.

In our sample data, we have 4 psitives (Section 2) and the facilities 6 (Section 1).

For a complete model, when we reached 40% of the population, we took 100% POSIDOMS.

The curve looks like this perfect model.

Photo by the writer

Lorenz zone area of ​​complete model.

[
begin{aligned}
text{Perfect Area} &= text{Triangle (0,0 to 0.4,1)} + text{Rectangle (0.4,1 to 1,1)} \[6pt]
& = FRAC {1} {2} times 0.4 Times 1 ; + ; 0.6 Times 1 \[6pt]
& = 0.2 + 0.6 [6pt]
& = 0.8
Finally {aligned}
]

We also have another way to calculate the area under the complete model turn.

[
text{Let }pi text{ be the proportion of positives in the dataset.}
]

[
text{Perfect Area} = frac{1}{2}pi cdot 1 + (1-pi)cdot 1
]
[
= frac{pi}{2} + (1-pi)
]
[
= 1 – frac{pi}{2}
]

For more information:

Here, we have 4 positives in 10 records, so: π = 4/10 = 0.4.

[
text{Perfect Area} = 1 – frac{0.4}{2} = 1 – 0.2 = 0.8
]

It calculates a place under Lorenz's curve of our sample data and full model with the same amount of information and negative items.

Now, if we pass through the Datasette without being planned, the Positive Positive. This means that the rate we collect the positives such as a measure we exceed between people.

This is a random model, and always gives place under 0.5 curve.

Photo by the writer

Step 5: Count Gini Coefficless

[
A_{text{model}} = 0.775
]

[
A_{text{random}} = 0.5
]
[
A_{text{perfect}} = 0.8
]
[
text{Gini} = frac{A_{text{model}} – A_{text{random}}}{A_{text{perfect}} – A_{text{random}}}
]
[
= frac{0.775 – 0.5}{0.8 – 0.5}
]
[
= frac{0.275}{0.3}
]
[
approx 0.92
]

We found GINI = 0.92, meaning almost all good things focus on the top of the fixed list. This shows that the model does the best work to distinguisheless gifts, you approach perfection.


As we have seen how Gini Coefflication is calculated, let's look at what we have done when we are counted.

We look at the 10-points sample containing opportunities from the Logistic Regression.

We have planned the opportunities available.

The following, calculated the number of accumulated (%) and overtaking posters (%) and built them.

We found a curve called Lorenz a curve, and we counted a place under it, which is 0.775.

Now let's understand what 0.775?

Our sample contains 4 positsites (section 2) and 6 negatives (paragraph 1).

Opportunities to leave class 2, meaning that the customer may have been over class 2.

In our sample information, posters are taken within 50% of the people, meaning all the positives are high.

If the model is perfect, then positives are held within 4 first lines, that is, within 40% of the population, then under the complete model curve is 0.8.

But we have AUC = 0.775, almost complete.

Here, we try to calculate the functional function. If many plans are focused on, it means that the model is correct in separating positives and irregularities.

Next, we counted CINI Coeffler, which is 0.92.

[
text{Gini} = frac{A_{text{model}} – A_{text{random}}}{A_{text{perfect}} – A_{text{random}}}
]

The figure tells us how much our model is rather than random speculation.

Denominator tells the better the highest improvement of random.

The average sets these two together, so its gini coeffure is always falling between 0 (random) and 1 (perfect).

GIN is used to measure how closing the model completes in classing beautiful and corrective classes.

But we can get a doubt about it why we have calculated gini and why we don't give up after 0.775.

0.775 is the lower lorenz turning on our model. It doesn't tell us how to approach it to complete without comparing it with 0.8, which is a complete model area.

Therefore, we calculate gini to hold you slowly to fall between 0 and 1, making it easier to compare models.


Banks also use Gini Coefflication to explore the risk models of credit beside ROC-AUC and KS Statistics. Together, these steps provide a complete image of models.


Now, let's count the ROC-AUC for our sample data.

import pandas as pd
from sklearn.metrics import roc_auc_score

# Sample data
data = {
    "Actual": [2, 2, 2, 1, 2, 1, 1, 1, 1, 1],
    "Pred_Prob_Class2": [0.92, 0.63, 0.51, 0.39, 0.29, 0.20, 0.13, 0.10, 0.05, 0.01]
}

df = pd.DataFrame(data)

# Convert Actual: class 2 -> 1 (positive), class 1 -> 0 (negative)
y_true = (df["Actual"] == 2).astype(int)
y_score = df["Pred_Prob_Class2"]

# Calculate ROC-AUC
roc_auc = roc_auc_score(y_true, y_score)
roc_auc

We have received an AUC = 0.9583

Now, gini = (2 * AUC) – 1 = (2 * 0.9583) – 1 = 0.92

This is a relationship between gini & roc-auc.


Now let's count the perfect Ginaset Coeffler.

Code:

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# Load dataset
file_path = "C:/german.data"
data = pd.read_csv(file_path, sep=" ", header=None)

# Rename columns
columns = [f"col_{i}" for i in range(1, 21)] + ["target"]
data.columns = columns

# Features and target
X = pd.get_dummies(data.drop(columns=["target"]), drop_first=True)
y = data["target"]

# Convert target: make it binary (1 = good, 0 = bad)
y = (y == 2).astype(int)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Train logistic regression
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)

# Predicted probabilities
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Calculate ROC-AUC
auc = roc_auc_score(y_test, y_pred_proba)

# Calculate Gini
gini = 2 * auc - 1

auc, gini

We found gini = 0.60

Translation:

Gini> 0.5: welcome.

Gini = 0.6-0.7: good model.

Gini = 0.8+: Very good, not available.


Dataset average

The data used in this blog is German credit data, which is popped in public to the UCI machine study. Provided under Creative Commons Recisions License 4.0 (CC is 4.0). This means that it can be used freely and allocated with proper adtification.


I hope you get this blog helpful.

If you enjoy reading, think to share your network, and feel comfortable to discuss your thoughts.

If you didn't read my blogs before ROC-AUC and Kolmogorov Smirnov Station, you can check them here.

Thanks for reading!

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button