RAG with Hybrid Search: How Does Keyword Search Work?

I talked a lot about Retrieval Augmented Generation (RAG). In particular, I have included the basics of the RAG method, as well as a number of relevant concepts, such as clustering, embedding, reordering, and retrieval testing.
The traditional RAG method is very useful because it allows searching for relevant parts of the text in a large knowledge base, based on definition of text rather than exact words. In this way, it allows us to use the power of AI in our custom documents. Ironically, as useful as this match search is, it sometimes fails to retrieve existing parts of the text the same at the user command. In particular, when searching in a large knowledge base, certain keywords (such as certain technical terms or terms) may be lost, and relevant parts may not be returned even if the user's query contains specific words.
Fortunately, this issue can be easily handled by using an old-fashioned keyword search method, such as BM25 (Best Match 25). Then, by combining the results of the same search and the BM25 search, we can get the best of both worlds and significantly improve the results of our RAG pipeline.
. . .
In information retrieval systems, BM25 is a ranking function used to evaluate how relevant a document is to a search query. Unlike similarity search, BM25 evaluates the relevance of a document to a user's query, not based on the semantic definition of the document, but rather on the actual words it contains. Specifically, BM25 is a bag of words (BoW) model, which means that it does not look at the order of words in a document (where the semantic meaning appears), but rather the frequency with which each word appears in the document.
The BM25 score for a given question q containing words t and document d (not the same) can be easily calculated as follows:
😿
As powerful as this statement can be, let's go back and look at it a little bit more.
. . .
Easy start with TF-IDF
The basic concept of BM25 is TF-IDF (Term Frequency – Inverse Document Frequency). TF-IDF is a basic information retrieval concept that aims to measure how important a word is to a particular document in the knowledge base. In other words, it measures how many knowledge base documents a word appears in, allowing in this way to reveal how specific and informative a word is about a given document. The rarer a word is in the knowledge base, the more informative it is about a particular document.
Mainly, the document d in the knowledge base and the term tThe Term Frequency TF(t,d) can be defined as:

and Inverse Document Frequency IDF

Then, the TF-IDF score can be calculated as the product of TF and IDF as follows:

. . .
Let's do a quick example to better grasp the TF-IDF. Let's imagine a small database containing three movies with the following definitions:
- “A sci-fi thriller about time travel and a perilous journey into alternate realities.”
- “A romantic drama about two strangers who fall in love during an unexpected trip.”
- “A sci-fi adventure featuring an alien explorer forced to travel across the galaxies.”
After removing the shortcuts, we can consider the following terms for each document:
- documentary 1: sci-fi, thriller, time travel, dangerous, entertainment, alternative, facts
- document size 1, |d1| = 8
- documentary 2: romance, drama, two, strangers, fall, love, unexpected, time, travel
- document size 2, |d2| = 9
- document 3: sci-fi, adventure, feature, alien, explorer, forced, travel, galaxies
- document size 3, |d3| = 8
- the total number of documents in the knowledge base N = 3
Then we can calculate f(t,d) for each term in each document:

Next, for each document, we recalculate the Document Frequency and the Inverse Document Frequency:

Then finally we calculate the TF-IDF score for each term.

So, what do we get from this? Let's look, for example, at the TF-IDF score for document 1. The word 'travel' is completely meaningless, as it is included in all the documents of the knowledge base. On the other hand, words like 'thriller' and 'dangerous' are very informative, especially document 1, as they are included only in it.
In this way, the TF-IDF score provides a simple and straightforward way to identify and rate the importance of terms in each document of the knowledge base. If we put it differently, the higher the total number of points in the document, the more unusual the information in this document is compared to the information contained in all other documents in the knowledge base.
. . .
Understanding the BM25 score
In BM25, we use the concept of TF-IDF to measure how informative (rare or how important) each document is in the knowledge base, about a particular question. To do this, in the calculation of BM25, we only pay attention to the terms of each document contained in the user query, and perform calculations similar to TF-IF.
BM25 uses the concept of TF-IDF, but with a few mathematical tweaks to improve the two main weaknesses of TF-IDF.
. . .
The first point of pain for TF-IDF is that TF has a line in the number of times a term t appears in the document d, f(t,d), as any function of the form:

This means that the more times the word t appears in document d, the larger the TF respectivelywhich, as you can imagine, can be a problem in large documents, where a word appears over and over again without being particularly important in relation to it.
An easy way to solve this is to use a saturation curve instead of a linear function. This means that the output increases with the input but approaches the maximum inversely, unlike a linear function, where the output increases with the input indefinitely:

Therefore, we can try to rewrite the TF in this form as follows, introducing the parameter k1, which allows the control of the frequency scaling. In this way, the parameter K1 allows to introduce diminishing returns. That is, the first appearance of the term ut in a document has a significant impact on the TF score, while the 20th appearance adds a small additional benefit.

However, this will result in values in the range 0 to 1. We can modify this a bit and add (k1 + 1) to the determinant, so that the resulting TF values can be compared to the original TF definition used in TD-IDF.

. . .
So far, so good, but one important piece of information missing from this speech is the document size |d| that is included in the initial calculation of TF. Anyway, before adding |d| long term, we also need to change it a bit as this is the second pain point of the initial TF-IDF speech. Specifically, the problem is that the database will contain documents of variable length |d|, resulting in many different values not being matched. BM25 solves this with the generalization |d|. That is, instead of |d|, the following expression is used:

where avg(dl) is the average document length in the database. Additionally, ub is a parameter in [0,1] which controls length normalization, with b = 0 corresponding to normalization and b = 1 corresponding to complete normalization.
Therefore, by adding the regular expression for |d|, we can get the TF version used in BM25. This will be like this:

Generally, the parameter values used are k₁ ≈ 1.2 to 2.0 and b ≈ 0.75.
. . .
BM52 also uses a slightly modified expression for the IDF calculation as follows:

This expression is obtained by asking a better question. In the IDF's first census, we ask:
“How rare is this name?”
Instead, when we try to calculate the IDF of BM25, we ask:
“How much more likely is this term in relevant documents than in irrelevant documents?”
The probability of a document containing the word t, in a knowledge base of N documents, can be expressed as:

We can then express the constraints of a document containing the word t versus not containing it as:

Then taking the opposite, we end up with:

Similar to the normal IDF, we find the log of this expression to suppress extreme values. A rare transformation called Robertson-Sparck Jones smoothing is also performed, and in this way, we finally get the IDF expression used in BM25.
. . .
Finally, we can calculate the BM25 score of a given document d for a given query q containing multiple words t.

In this way, we can score the documents found in the knowledge base based on their relevance to a particular query, and then extract the most relevant documents.
All this is just to say that the points BM52 is something like TD-IDF score is easy to understand, but very complex. Therefore, BM52 is very famous for doing keyword search and it is also used by us for keyword search in RAG system.
RAG with Hybrid search
So, now that we have an idea of how BM25 works and scores various documents in the knowledge base based on a number of keywords, we can continue to look at how BM25 scores are fed into the traditional RAG pipeline.
As discussed in many of my previous posts, a simple RAG pipeline would look like this:

Such a pipeline uses similarity scores (such as cosine similarity) of embeddings to search, find, and return segments that match the semantics of the user's query. Although a match search is very useful, sometimes matches can be missed. Therefore, by combining the keyword search, on top of the matching search in the RAG pipeline, we can identify the relevant fragments efficiently and completely. This will change our situation as follows:

For each part of the text, without embedding, we now also calculate the BM25 index, which allows a quick calculation of consecutive BM25 scores for various user queries. In this way, for each user query, we can identify fragments with high BM25 scores – that is, fragments that contain rare words, which are more informative about the user query compared to all other clusters in the knowledge base.
Note how we now match the user query to both the embedded vector store (semantic search) and the BM25 index (keyword search). Different fragments are returned based on semantic search and keyword search – then the found fragments are combined, extracted, and ranked using rank clustering.
. . .
In my mind
Integrating the BM25 keyword search into the RAG pipeline allows us to get the best of both worlds: the semantic understanding of embedding and the accuracy of exact keyword matching. By combining these methods, we can retrieve the most relevant fragments even from large databases with greater reliability, ensuring that critical terms, technical phrases, or words are not overlooked. In this way, we can greatly improve the efficiency of our discovery process and ensure that no important information is left behind.
Did you like this post? Let's be friends! Join me at:
📰A small stake 💌 In the middle 💼LinkedIn ☕Buy me a coffee



