Machine Learning

Metric (and llms) how can you deceive you: The last field guide

Overview

just a visible degree or puzzle of mind. And they can be sensible, causing the first detection to break up with the nearest investigation. In Data Science, parades arise when we take numbers in the amount of the face, without looking at the context behind them. A person can have a very sharp watch and go with the wrong story.

In this article, we discuss three reasonable things that serve as a military standards for anyone who interprets the data as soon as possible, without using the context. We examine how the curtains suggested on the Data Science & Business intelligence (bi) investigate the charges and rand the restorative systems (RAG), when it looks down the quality of both agreements provided and the results of the model.

Paradox's Paradox Kasimpson in Business Intelligence

Simpson's paradox describes the situation where the tendency is transformed when data is combined. In other words, the tendency to see in the subcategory is made when you include numbers and analyze them. Let us imagine that we evaluate the sale of four famous ice cream chain areas. When each site's sale is analyzed, it shows that the chocolate taste is very selected between customers. But when the sale is added, the practice is moving, and new combined results suggest that vanilla is very popular. This conversion is called Simpson's Paradox. We use fictional data below to show this.

Location Chocolate Sane Customers are perfect Chocolate %% Vanilla% Winner
Suburbi a The knee is purchased 20 75.0% 25.0% Chocolate
City B 33 27 Out of that 55.0% 45.0% Chocolate
Leopard 2080 1920 4000 52.0% 48.0% Chocolate
Airport 1440 2160 3600 40.0% 60.0% Sane
Entire 3568 4112 7680 46.5% 53.5% Sane!
Shipping the Ice Cream Chain Location at Ice Cream (by the writer)

Below is a visible image.

Paradox ksimpson in a bi-report report. Image (Photo by the writer)

The data analyst to see that these dynamic subgroup may think that the chocolate is low. Therefore, it is important for integrated numbers with subgroups and check the existence of Simpson Paradox. When returning is the best, changing variables should be identified as the next step. Various variables is a hidden feature of a group influencing team. In this case, the store site occurred to a flexible flexibility. The deeper understanding of interpretation contains why Vanilla Icecream sales were high at the airport, enforcing a complete outcome. Some questions that can be used to investigate:

• Are the airport resources available for several chocolate options?

• Do travelers like a strange taste?

• Was there a promotional campaign that loves vanilla in stores at the airport?

Simpson's paradox in RAG plans

Suppose you have a RAG model (the retroeval-augmented generagned) that focuses public feelings in relation to electric vehicles (Evs) and the answers to questions alike. The model uses stories articles since 2010 to 2024. Until 2016, EVs accepted integrated ideas due to their breadth, higher purchase amount, and lack of charging channels. All these things make driving in Evs for long distances is impossible. Newspaper reports before 2017 used to highlight such shortcomings. But from 2017, the Evs began to be seen during the development and achievement of charging stations. This converting examination occurs mainly after the successful start of Tesla premium. The RAG model uses news reports from 2010 to 2024 will likely provide conflicting responses to the same questions, which will eat the Site of the Sempson.

For example, if the RAG is asked, “Is EV acceptance in the US construction at the bottom?” If the Rag is asked, “EV acceptance of Eviel is recently in US?” In this case, changing variables is a form of publishing. Practical adjustment in this issue to mark documents (articles) in the bins based on the time of pre-performance. Some options include promoting users to specify the time range in their early early early early early early anation (EV.

Simpson Guide in Rag Systems (Photo by the writer)

Paradox accuracy in data science problems

The Famaracto Crux is that high accuracy is not displayed for useful issues. Let's imagine that it creates a classification model to find that the patient has a unusual disease that only affects 1 in 100. However, it is unable to identify one person with disease and you need medical care. That way, the model is not employed to find the disease, which is its purpose. This is especially in the dataset dataset when the visual perception is small. This is indicated in this figure below.

Paradox accuracy in data science (Photo by the writer)

The best way to deal with paradox diagnostic is to use mathemakers capture of small classes, such as accuracy, remembering, and F1-score. Another way to follow to treat unemployed information as an oranaly's acquisition problems, against disassociating problems. One can think of and collect a little class data (if possible), more – in the sample a small paragraph, or suggest a lot of class. Below a quick guide that helps decide which metric would use to accommodate the application, purpose, and defect results.

Choosing the correct metric to measure your model performance (Photo by the writer)

The accuracy of paradox in llms

While paradox accuracy is a common problem that many scientists face, the effects of the llms are very ignored. The metric Seric can be accidentally highlighted in the use of cases including safety, detection, and reducing. The high accuracy does not mean that the model is correct and safe to use. For example, the LLM model has 98% accuracy does not use it when Miscclassifies 2 cruel encouragement as safe and harmless. Therefore, in the ILM test, it is a good idea to use the mind, accuracy, or PR-AUC with accuracy, as it shows how well the model is with small classes.

HODHART LAW ACT

Economist Charles Goodhart Say that “When a measure becomes a target, it stops becoming a good measure.” The Act is a gentle reminder that if you are preparing the metric preparations without understanding and context, the model will return.

The Agency Manager of the unemcented news that puts the KPP in his team. She asks the team to work to increase the time of a 20% session. The team extends articulate prostitutes and adds a filler content to increase the time of the session. The session time goes up, but the video quality is lost, and as a result, the number of users available to the video is decreasing.

Another example is related to customer churn. In an attempt to reduce customer churn, the entertainment-based app decides to place the 'Rebscribcraft' key button in a combination in the area used in its Web Portal. As a result, the customer Churn is decreasing, but it is not due to advanced customer satisfaction. Only due to restricted output options – a customer maintenance. Below is a visual image that shows how to meet or more efforts to grow (like the increasing time of time or user involvement) can lead to unintended results, lead to the user experience. When groups turn to infilation tactics to help drive efficient metrics, the metric development looks good on paper, but it is unreasonable in any way.

HODhart Law – Figure (Photo by Author)

HODHART ELLMA law

When training the llm is very specific in some specific data (especially benchmark), it can memorize patterns from that training data instead of reading to learn. This is the best example of excessive criticism, when the model works very well in that training data but does misuse of the real world installation.

Let's imagine that you train a LLM to summarize stories. Using the Rouge (Remembering – Focused – Fixed to a Personal Metric Assessment) for the performance of the llM. Rouge Metric Rewards of Direct or nearer N-Gram Rewards with reference abbreviations. In time, the LLM begins to copy large text phrases from the installation articles for increased rage points. It also applies the most visible buzzwords in the references of the reference. Let's imagine that the installation article has a “bank to increase the loan amounts to disconnect inflation, and this has caused a significant decline.” The surface model of the Overfit would summarize the “Bank Rooted Loan Model to Discoverable Inflation”, and the standard model will summarize the “interest rate causes stock markets”. This image shows how much your model is too much that the test metric can lead to low-quality answers (good answers on paper but they do not help).

HODHART ELLMA law (Photo by the writer)

To conclude the words

Whether business intelligence or llms, parades can come inside if numbers and metric is hosted without mental and basic context. Also, it is important to remember that it is worthless can damage the big picture. Integrating the abundance of human understanding is essential to avoid such obstacles and create reliable reports and powerful llms bring about value value.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button