Machine Learning

How to check the quality of recovery in rag pipes (Part 3): DCG @ K and NDCG @ K

Also make sure to check out the previous sections:

👉Part 1: Precision @ k, recall @ k, and F1 @ k

👉Part 2: Mean return rate (MRR) and average precision (AP)

In my series of introductions to rag pipe retrieval testing methods, we took a detailed look at Binary Retrieval testing metrics. More seriously, in part 1, we went over binary, order-aware metrics for evaluating the restoration of the world, such as Hitrate @ K, recall @ k, accuracy @ k. Binary, the order of return metrics metric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric ametric order is actually the basic type of methods that we can use to get the power to find the performance of our method of retrieval; They simply separate the result into whether it is relevant or not, and check if the relevant results increase the resulting set.

Then, in part 2, we looked at binary, evaluation metrics such as dissiocal position (MRR) and average accuracy (AP). Binary, order recognition methods classify the results as either appropriate or inappropriate and check if they appear in the return set, but more than this, and find out how well the results are calculated. In other words, they also pay attention to the position in which each result is obtained, regardless of whether it is obtained again or not initially.

In this final part of the metric metric retrieval series, I will also expand on another major category of metrics, beyond binary metrics. That, Metrics included. Unlike binary metrics, where the results are either valid or invalid, in flat metrics, consistency is a spectrum. In this way, the received chunk can be More or less appropriate to the user's query.

Two of the most commonly used metrics that we will be looking at in today's post are reduced leverage (DCG @ K) and reduced leverage (NDCG @ k).


I write 🍨Data detailswhere I study and experiment with AI and data. Register here learning and testing with me.


Other planned methods

With genetics recovery steps, first it is important that you understand the concept of grad compatibility. That is, with systematic measures, the recovered object can be more inappropriate or more specific, as specified by rel_i.

Photo by the Author

🎯 applied profit (DCG @ K)

The profit used (DCG @ K) is a test metric, allowing us to measure how useful the result is, taking into account the position where we get it. We can count them as follows:

Photo by the Author

Here, the Numbarator rel_i Is the relative relevance of the obtained result, in fact, an extension of the way the chunk of text was obtained. In addition, the denominator of this formula is the log of the position of the result. In practice, this allows us to penalize items from the derived set with lower positions, reinforcing the idea that results from the top are more important. Therefore, the ideal result, the higher the score, but lowering the position that appeared in At, lowers the score.

Let's re-examine this with a simple example:

Photo by the Author

In any case, the main issue of DCG @ K is that, as you can see, it is really a function of the sum of all the right things. Therefore, the re-discovery of findings with many factors (big ik) and / or relevant factors will lead to DCG @ K. For example, if there is an example, just consider K = 4, we will end up with DCG @ 4 = 28.1.19. Similarly, DCG @ 6 will be higher and so on. As uk increases, DCG @ K generally increases, because we include more effects, except that the additional items have zero correlation. However, this does not mean that its performance is high. On the contrary, this causes a problem because it does not allow us to compare the sets obtained with different values ​​of k derived from DCG @ K.

This issue is successfully resolved by the next measure of Bradd that we will discuss later today – NDCG @ k. But before this, we need to add IDCG @ K, which is required to calculate NDCG @ K.

🎯 discounted net profit (IDCG @ k)

The appropriate gain used (IDCG @ k), as its name indicates, DCG would have entered the right situation when our set was well organized based on the results obtained. Let's see what the IDCG of our example will be:

Photo by the Author

Obviously, with K, IDCG @ K will always be equal to or greater than any DCG @ K, because it represents the total return score and the rank of K results.

Finally, we can now calculate the discounted net profit (NDCG @ K), using DCG @ K and IDCG @ K.

🎯 Net Gross Profit (NDCG @ K)

Cumulative Exegatived Cumulative (NDCG @ K) is in fact a general expression of DCG @ K, Solving our first problem and giving a comparative comparison with the different size obtained k. We can calculate NDCG @ K with this exact formula:

Photo by the Author

Basically, NDCG @ K allows us to let you know how to close our current restoration and the ideal situation, in K. This gives us the convenience of that number IS is compared with different values ​​of k. In our example, NDCG @ K = 5 would be:

Photo by the Author

In general, NDCG @ K can range from 0 to 1, with 1 representing complete recovery and effect position, and 0 indicating complete contamination.

So, how do we actually calculate DCG and NDCG in Python?

If you've read my other rag tutorials, you know that's where it is War and Peace For example we used to enter. Anyway, this Code example gets a lot to put in every post, so I'll show you how to calculate DCGG, doing everything possible to keep this post at the right length.

To calculate these metrics that return metrics, first we need to define the truth set, just like we did in part 1 when we calculated the following accuracy @ k and recall @ k. The difference here, instead of specifying each chunk that is found as appropriate or not, using binary compatibility (0 or 1), we now assign it a Score of order; For example, from completely inappropriate (0), to very appropriate (5). Therefore, our ground truth setup will include the documents with the highest ranked scores for each question.

For example, with a question like “Who is Anna Pávlovna?”a found chunk that closely matches the answer would receive 3 points, one that addresses the required information would receive 2, and a completely unrelated chunk would receive a relative score equal to 0.

Using this list of reversed complementarities instead of reversed complementarities, we may calculate DCG @ K, IDCG @ K, and NDCG @ K. We will use Python's math library to handle logarithmic terms:

import math

First, we can define a calculation function DCG @ k As follows:

# DCG@k
def dcg_at_k(relevance, k):
    k = min(k, len(relevance))
    return sum(rel / math.log2(i + 1) for i, rel in enumerate(relevance[:k], start=1))

We can also count Idcg @ k using the same concept. Actually, Idcg @ k IS DCG @ k For complete retrieval and position; Therefore, we can easily retrieve them by calculation DCG @ k After sorting the results in relative descending order.

# IDCG@k
def idcg_at_k(relevance, k):
    ideal_relevance = sorted(relevance, reverse=True)
    return dcg_at_k(ideal_relevance, k)

Finally, after we count DCG @ k and Idcg @ kwe can easily calculate NDCG @ k as their work. Specifically:

# NDCG@k
def ndcg_at_k(relevance, k):
    dcg = dcg_at_k(relevance, k)
    idcg = idcg_at_k(relevance, k)
    return dcg / idcg if idcg > 0 else 0.0

As explained, each of these functions takes as input a list of fixed compatibility points for the received chunks. For example, let's say that for a certain query, the world is set True, and to get the results of the test, we end up with the following list:

relevance = [3, 2, 3, 0, 1]

After that, we can calculate the set returned metrics using our functions:

print(f"DCG@5: {dcg_at_k(relevance, 5):.4f}")
print(f"IDCG@5: {idcg_at_k(relevance, 5):.4f}")
print(f"NDCG@5: {ndcg_at_k(relevance, 5):.4f}")

And that was it! This is how we get our Pipeline return methods for our pipeline in Python.

Finally, as with all other available metrics, we may measure the metric scores with different questions to obtain the most representative scores.

In my mind

Today's post about branch compliance methods concludes my introduction series about the metrics most commonly used for evaluating the performance of RAG Pipelines. In particular, throughout this series of posts, we examined binary methods, ordering – not knowing and ordering, and steps with rooms, to get a complete idea of ​​​​how we talk about this. Obviously, there are many other things that we can look at to check the rag pipe retrieval process, for example, the latency per query or the tokens sent. However, the steps I've covered in this post cover the basics of Retrieval performance testing.

This allows us to measure, test, and ultimately improve the performance of the retrieval method, ultimately spreading the way to build an efficient rag pipeline that produces visible, sworn responses to optional documents.


Did you like this post? Let's be friends! Join me at:

📰Put it down 💌 The medium 💼LinkedIn Buy me a coffee!


Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button