Water Cooler Small Talk, Ep. 11: Overfitting in RAG testing

0 1 8 minutes read

Water Cooler Small Talk, Ep. 11: Overfitting in RAG testing

is a special type of small talk, often seen in office spaces around the water cooler. There, employees often discuss all kinds of corporate gossip, myths, legends, inaccurate scientific theories, unintelligent personal stories, or outright lies. Anything goes. In my Water Cooler Small Talk posts, I discuss strange and often scientifically invalid ideas that I, my friends, or acquaintances have heard in their office that have left us speechless.

So, here's a water cooler idea for today's post:

We have built a RAG app that plays very well. We are now in the testing phase, and it is going well because in every test we continue to identify problems and fix them. We are already at 97% points.

Now, I want you to stop for a moment and think about what might be wrong with this statement. 🤔 Because on the surface, it makes perfect sense. Finding problems and fixing them sounds exactly like the testing process that should be done, right? You are responsible, I mean. So what really happened?

The problem here is subtle but fundamental. If you use your testing process to identify problems and fix those problems, then retest on the same set of tests, unfortunately you're not really testing anymore. The test set has one main feature that makes it very useful: the model has never seen it before. Every time you fine-tune based on its results and re-test on the same set, you're stripping something more of that element. In other words, the test set has become part of the silent development process and is now the training set.

But doing this correctly is easier said than done. In fact, conducting the evaluation process properly can be quite a chore. In particular, if you are talking about running RAG application testing, meaning that the test set is a set of paired questions and answers, rather than a historical dataset, doing it right can be tedious and time-consuming. However, failure to perform the test correctly causes a well-known ML problem: to exaggerate.

What about overdosing?

Let's go back and detour a bit into the basics of ML.

In machine learning, a model is built using data that is often classified into a training seta validation setand a test set. Specifically, the model starts with a training set, which is the data used to show what kind of model we need to use and adjust the model parameters. In its simplest form, the training set consists of x and y pairs of data, and our goal is to come up with a model ay = f(x) that best fits the available x and y data.

Once that is done, the trained model is used to predict the results on the validation set. Specifically, for each x in the validation set, we generate the predicted uy = f(x) based on the chosen model, then check how it compares to the actual y of the validation set, and adjust our model accordingly.

Finally, and after deciding which model we want to proceed with based on the validation step, we run it again on the test set. The goal of a test set is to see how well the final model fits to data it has never seen before by calculating its scores, and that's why a test set should only be used once.

We do all this because our goal is not to match the training set, but rather what the training set represents. In this way, we can create models that learn underlying patterns well enough to make accurate predictions on new, unobserved data (the test set).

Unfortunately, we sometimes fail to do so, and instead of creating models that fit the general case, we create models that simply fit the narrow training set without generalization. This is what we call it to exaggerate. As a result, the model performs very well on the training set, scoring amazingly well, but poorly on anything new.

The trick here is that the test set only makes sense if the model has never seen it before. The moment you use it to make a decision about a model, even a small one, you have stopped and included it in the training set.

But after this small detour into the basics of ML, let's go back to our original idea of water cooling.

Overfitting in the RAG test

This is where things get especially impactful for those of us who build and test AI applications.

In my RAG pipeline testing series, we talked a lot about recall metrics: Precision@k, Recall@k, MRR, NDCG@k, and so on. Still, all those great metrics are only as useful as the tests they're set on. It turns out that the line between tests and test sets in RAG can blur surprisingly easily. I can attribute part of this to the fact that, unlike a simple regression model, AI models and RAG pipelines are far from being intuitively assisted. We have little real idea of how well the model fits the data, and as a result, we may get carried away and tune the system based on a set of tests without realizing that we have done so.

The team in our water cooler case did just this. They identify problems during testing, correct them, and re-evaluate on the same answer questions. Naturally, with every iteration, the analysis score improves because we are actually putting the AI application into the test set.

In particular, here are the most common ways this happens in RAG:

To tune the commands in the test set: This is probably the most common pattern, and it is exactly what happened in our case of water cooling. Using analytics, note that certain types of queries keep failing, and adjust your system information or retrieval logic to correct them. Then test again on the same set. Yes, the score is improving; you might even be able to score 100%.
Cherry picking questions the system already handles well: It's a subtle version of the same problem. When building a test set, it's tempting to include examples that you already know the program works well on, especially those that you've tested informally. Over time, testing sets the drift to the system's power and away from its blind spots. The metrics look good, but in reality, no one knows what the real performance is.
To create your own test questions from the same documents you referenced: If the questions in your test set are written with a close look at the documents in your knowledge base, they are more likely to be shaped by what you already know can be retrieved. In other words, the questions were not truly independent of the data, but again, this is very difficult to see since we are talking about questions and answers in natural language rather than just the numbers x and y.

The simple but difficult fix for all those situations is the same as the classic machine learning solution: keep a really tight test set that you can touch as little as possible, build your questions independently of the system's known behavior, and treat good metrics with skepticism. A RAG system that works well on a small, carefully selected, repeated set of tests is very similar to a student who has memorized past exam papers but is completely unprepared for the first real question that doesn't look exactly like the ones they've already seen.

If you want to clear your mind – check your RAG test setup, here is a short list of questions to think about and ask yourself honestly:

When I created my test set, did I write questions independently of the texts in my knowledge base, or did I look at the texts first and write questions that I already knew were answered?
Have I dropped or changed a question from my test set because the app keeps failing?
Do I know roughly how my system works on questions it hasn't been tested on before, or on the same fixed set that I keep reusing?
Is there a part of my test set that has been sitting untouched and out of sight for a long time?

If you answered no to that last one, you may already be part of today's water cooler story. 😉

Overfitting in real life: Goodhart's Law

Goodhart's Law, coined by economist Charles Goodhart in 1975, is like an adage that goes like this:

When a measure becomes objective, it ceases to be a good measure.

This idea originally came from monetary policy, but it covers well beyond economics, and shows almost everywhere a number used to judge performance, such as KPIs, budgets, and all kinds of numbers. Think of a car salesman who is rewarded for the number of cars he sells each month, and starts selling more cars, even if he loses; hospitals that try to reduce the length of stay of patients, and end up discharging patients early; citation counts in scientific publications made by the game, and so on.

All of these examples work in the same basic way: a dose rate is introduced to keep track of something important. For a while, the ratio and the real thing go hand in hand, and it feels like we can now trust the evolution of the ratio to track the evolution of the real thing. Then people (or systems) start working directly on the measure instead of the important thing, and the two quietly drift apart. Then the ratio starts to improve without the important factor that was meant to represent the improvement in the same way.

In AI specifically, this failure mode is called reward hacking, which occurs when an AI system prepares an ill-defined reward without achieving the intended outcome. Similarly, in classic ML, overfitting is what happens to a model when the training signal stops representing the true underlying pattern. Goodhart's law is what happens to us, the people designing the system, when our test signal stops representing what we really care about.

In my mind

What I find most interesting about overfitting, especially in RAG systems, is that it's not really a technical problem. It is mainly a problem of understanding and sticking to the plan. It's tempting to compromise that process and prepare the scores directly, especially with RAG datasets that don't look like the datasets we're used to in classical ML.

However, this pattern reflects more than machine learning and AI. In real life and in machine learning, the solution is the same: to stay consistent and never lose sight of the real thing you're trying to achieve. In ML and AI, that object is for the model to work realistically and produce meaningful results when it is generated and confronted with real-world data, not just to achieve high scores during testing.

The team at our water cooler is doing nothing wrong. On the contrary, what they are doing is feeling responsible and fine-tuning the application based on the test results. And that's exactly what makes overdosing so dangerous. It doesn't look like a mistake when it happens. It seems like one in retrospect, when the system meets the real world and the results stop holding.

✨ Thanks for reading! ✨

If you've made it this far, you may find pialgorithms useful — a platform we've been building that helps teams securely manage organizational information in one place.

Did you like this post? Join me 💌A small stake and 💼LinkedIn

All photos by the author, unless otherwise noted

Source link

nimda 2 hours ago

0 1 8 minutes read