How to test retrieval quality in rag pipes (Part 2): Mean retrieval rate (MRR) and average precision (AP)

0 1 7 minutes read

How to test retrieval quality in rag pipes (Part 2): Mean retrieval rate (MRR) and average precision (AP)

If you missed it Part 1: How to check the quality of returns on rag pipescheck it out here

In my previous post, I looked to evaluate the recovery quality of a rag pipe, as well as some basic metrics in doing so. Specifically, that first part focuses more on binary measures, of unconscious order, actually testing if the results are appropriate for the set obtained or not. In this second part, we will also develop blind, command-line methods. That is, measures that take into account the degree to which each relevant outcome is obtained, without evaluating if it is obtained or not. Therefore, in this post, we will take a closer look at two metrics used in binary, Mean retrieval position (mrr) and Average rating (AP).

Why are there important factors in the return test

Effective feedback is very important in the integrated pipeline, given that a good feedback method is the first step in creating valid responses, based on our documents. Besides, if the relevant documents containing the required information cannot be identified the first time, no AI software can fix this and provide valid answers.

We can distinguish between two major categories of retrieval methods for qualitative analysis: Binary methods and trees. Specifically, binary methods classify the received chunk as either valid or invalid, with no in-between cases. On the flip side, when you use structured methods, we check that the consistency of the chunk in the user's query is very good, and in this way, the chunk obtained can be more or less.

Binary methods can be further divided into order-don't know and order methods. Agnostic order steps analyze whether a chunk is present in the receiving area or not, regardless of how it was received. In my last post, we took a detailed look at some of the most common binary steps, and ran a deep code example in Python. That is, we passed Hitrate @ K, Precision @ K, Remember @ k, and F1 @ k. On the contrary, binary, command methods, without thinking if chunks are present or not in the received area, also take into account gun.com they are rediscovered.

Thus, in today's post, we will have a detailed look at the most commonly used binary recovery metrics, such as mr and Wingand also check how this can be calculated in Python.

I write 🍨Data detailswhere I study and experiment with AI and data. Register here learning and testing with me.

Other ways to know, binary ways

So, binary, order – methods like accuracy @ k or recall @ K tell you whether the correct documents are present in the top or not, but you don't show whether the document hits the top or bottom of those k chunks. And this information is exactly what instructional methods give us. Some of the most useful and widely used methods – unknown methods Mean retrieval position (mrr) and Average accuracy (AP). But let's look at all this in more detail.

🎯 Mean return position (mrr)

A commonly used measure of test order to review returns is a rep ress (MRR). Taking one step back, the return position (RR) reveals In which case the first really relevant result is obtainedamong the top K results Found. Ideally, it measures how prominent the first relevant outcome is in the situation. RR can be calculated as follows, by rand_i Being a position the first relevant result is available:

Photo by the Author

And we can check this calculation with the following example:

Now we can put it together Mean retrieval position (mrr). Mrr expresses – common the position of the first appropriate element in every unique set.

In this way, MRR can range from 0 to 1. That is, the higher the MRR, the higher the condition of the original document.

A real life example where a metric such as MRR can be useful in evaluating a step to retrieve a rag pipe would be any dynamic site, and where we need to be sure that the right result really appears at the top of the search. It works well with test programs where one relevant result is enough, and important information is never scattered across multiple chunks of text.

A good metaphor for understanding MRR as a retrieval metric is Google search. We think of Google as a great search engine because you can find what you need with high results. If you had to scroll down to 150 leads to actually find what you were looking for, you wouldn't think of it as a good search engine. Similarly, a good vector search method in a rag pipe should look for the right chunks at the highest levels and perform the highest mrr.

🎯 Average accuracy (AP)

In my previous post on Binary, Order-Unangare Retrieval methods, we looked specifically at precision @ k. In particular, clarity @ k indicates how many of the top K returned documents are actually relevant. The accuracy @ k can be calculated as:

Average accuracy (AP) continuing to build on that idea. Specifically, to calculate AP, it is necessary to start Interative with precision @ K for each K when a new object appears. Then we can calculate by simply calculating the average of those Precision @ k.

But let's see a physical example of this calculation. With this example set, we see that the new chunks are assigned after K = 1 and K = 4.

So, we calculate accuracy @ 1 and accuracy @ 4, and take their average. That would be (1/1 + 2/4) / 2 = (1 + 0.5) / 2 = 0.75.

We can then calculate the AP calculation as follows:

Also, AP can range from 0 to 1. More precisely, the higher the AP score, the more our retrieval system places the relevant documents at the top. In other words, more relevant documents are available and appear automatically forward inappropriate.

Unlike MRR, which only focuses on the first relevant result, AP looks at the position of all available available chunks. It really comes down to how much or how little trash we come across, while bringing back the really relevant, K's stuff.

Getting a better grip on AP and MRR, and we can think of them in terms of Spotify playlists. Similar to the Google search example, a high MRR can mean that the first song on the playlist is our favorite song. On the flip side, a higher ap can mean that the entire playlist It's great, and many of our favorite songs appear frequently and towards the top of the playlist.

So, are you searching for our vector anywhere good?

Normally, I would continue this section with War and Peace For example, as I have done some of my rag studies. However, the full return code is becoming increasingly large to include in every post. Instead, in this post, I will focus on showing how I calculate these metrics in Python, doing my best to keep the examples short.

Anyway! Let's see how mr and Wing can be counted on to work with the rag pipe in Python. We can define arithmetic operations Rr and mr As follows:

from typing import List, Iterable, Sequence

# Reciprocal Rank (RR)
def reciprocal_rank(relevance: Sequence[int]) -> float:
    for i, rel in enumerate(relevance, start=1):
        if rel:
            return 1.0 / i
    return 0.0

# Mean Reciprocal Rank (MRR)
def mean_reciprocal_rank(all_relevance: Iterable[Sequence[int]]) -> float:
    vals = [reciprocal_rank(r) for r in all_relevance]
    return sum(vals) / len(vals) if vals else 0.0

We have already counted Precisely @ k In previous posts as follows:

# Precision@k
def precision_at_k(relevance: Sequence[int], k: int) -> float:
    k = min(k, len(relevance))
    if k == 0: 
        return 0.0
    return sum(relevance[:k]) / k

The structure in that, we can explain Average accuracy (AP) As follows:

def average_precision(relevance: Sequence[int]) -> float:
    if not relevance:
        return 0.0
    precisions = []
    hit_count = 0
    for i, rel in enumerate(relevance, start=1):
        if rel:
            hit_count += 1
            precisions.append(hit_count / i)   # Precision@i
    return sum(precisions) / hit_count if hit_count else 0.0

One of these functions assumes to enter a list of binary labels of binary compatibility, where 1 means that the received chunk is relevant to the query, and 0 means that it is not. In practice, these labels are generated by comparing the obtained results with the global truth set, just as we did in part 1 when we calculated precision @ k and recall @ k. This way, for each question (for example, “Who is Anna Pávlovna?“), we generate a list of related binary objects according to which each received chunk contains the response text. From there, we can calculate all the metrics using the functions as shown above.

Another useful metric is the order we can calculate Mean accuracy medium (map). As you can imagine, the map is the purpose of the APS of various restored sets. For example, if we calculate the AP for three different test questions on our rag pipe, the fabric score tells us the highest quality of all of them.

In my mind

Binary order – unknown methods we saw in the first part of this series, such as Hitrate @ K, PretSionion @ K, remember @ k, and F1 @ k, He can provide us with valuable information to evaluate the performance of RAG PIPEline. However, such measures only provide us with information about whether or not the relevant document exists in the receiving location.

Binary Order observation steps reviewed in this post, such as Mean retrieval position (mrr) and average precision (AP) It can give us more insight, as it not only tells us that the relevant documents are present in the results obtained, but also what exactly they are. This way, we can have a better view of how well the Retrieval Mechanism of our rag pipeline is performing, depending on the task and the type of documents we are using.

Stay tuned for the next and final part of this recovery recovery series, where I'll be discussing Retrieval methods rag pipes.

Did you like this post? Let's be friends! Join me at:

📰Put it down 💌 The medium 💼LinkedIn ☕Buy me a coffee!