Content Understanding and Context Retrieval in RAG

In my latest post, I how hybrid search can be used to improve the efficiency of the RAG pipeline. RAG, in its basic version, just uses semantic search for embedding, it can be very effective, allowing us to use the power of AI in our documents. However, semantic search, as powerful as it is, when applied to large databases, can sometimes miss an exact match to a user's query, even if it exists in documents. This weakness of traditional RAG can be addressed by adding a keyword search component to the pipeline, such as BM25. In this way, hybrid search, including semantic and keyword search, leads to more comprehensive results and greatly improves the performance of the RAG system.
However, even when we use RAG for mixed searches, we can sometimes miss important information that is scattered in different parts of the document. This is possible because when a document is split into chunks of text, sometimes the context – that is, the surrounding text of the chunk that forms part of its meaning – is lost. This is especially possible in a complex text, which has a connected meaning and is spread over several pages, and definitely cannot be completely contained within a single paragraph. Consider, for example, referring to a table or figure in multiple sections of the text without clearly specifying which table we are referring to (eg, “as shown in the table, profit increased by 6%” — where is the table?).As a result, when pieces of text are returned, they are removed from their context, sometimes leading to the return of irrelevant pieces and the production of irrelevant answers.
This loss of context has been a major problem in RAG systems for a long time, and several unsuccessful solutions have been tested for improvement. An obvious attempt to improve this, increases the chunk size, but this often changes the semantic meaning of each chunk and ends up making the retrieval less accurate. Another way is to increase chunk stacking. Although this helps to increase the storage of context, it also increases the cost of storage and computation. Most importantly, it doesn't fully solve the problem – we can still have important links in a chunk without chunk boundaries. More advanced approaches to try to solve this challenge include Virtual Document Embedding (HyDE) or Document Abstract Indexing. However, those still fail to provide significant improvements.
Finally, a method that effectively solves this and greatly improves the results of the RAG program content retrievaloriginally launched by Anthropic in 2024. Contextual retrieval aims to solve context loss by preserving the context of components and, thus, improving the accuracy of the RAG pipeline retrieval step.
. . .
What about context?
Before saying anything about context retrieval, let's go back and talk a little bit about what context is. Sure, we've all heard of LLMs or context windows, but what are they really about?
More precisely, context refers to it all tokens available in LLM and based on which it predicts the next word — remember, LLMs work by generating text by predicting it one word at a time. So, that would be user information, system information, instructions, skills, and any other guidelines that influence how the model responds. Importantly, part of the answer to keep the model generated so far is also part of the context, because each new token is generated based on everything that came before it.
Obviously, different conditions lead to the output of very different models. For example:
- 'I went to the restaurant and ordered a'can't get out'pizza.'
- 'I went to the pharmacy and bought some'can't get out'medicine.'
The basic limitation of LLMs is their own content window. The LLM context window is the maximum number of tokens that can be passed simultaneously as input to the model and considered to produce a single response. There are LLMs with larger or smaller context windows. Modern boundary models can handle hundreds of thousands of tokens in a single request, while earlier models often had context windows as small as 8k tokens.
In a perfect world, we'd like to convey all the information an LLM needs to know in context, and we'd probably get very positive responses. And this is true to some extent – a borderline model like Opus 4.6 with a 200k token context window corresponds to about 500-600 pages of text. If all the information we need to provide fits this size limit, we can just enter everything as it is, like entering into LLM and get a good response.
The problem is that in most real-world AI implementations, we need to use some kind of knowledge base that is larger than this limit – think, for example, legal libraries or technical equipment manuals. Since models have these context window limitations, unfortunately we can't just pass everything to LLM and let it magically respond — we have to choose what the most important information to be included in our limited content window. And that's really what the RAG method is all about – selecting the right information from a large database to effectively answer a user's question. Ultimately, this turns out to be a development/engineering problem – context engineering – identifying the right information to put into the limited content window, to generate the best possible responses.
This is a very important part of the RAG program – ensuring that the right information is returned and passed on as input to the LLM. This can be done through semantic search and keyword search, as already explained. However, even if it brings all the pieces exactly together with all the exact matches, there is still a good chance that others important information may be omitted.
But what kind of knowledge is this? Since we've covered semantic search and exact match keyword searches, what other types of information should be considered?
Different texts with different meanings may naturally contain the same or similar parts. Think of a recipe book and a chemical processing manual that both instruct the reader to do 'Heat the mixture gently'. The semantic meaning of such a passage of text and real words are very similar – they were the same. In this example, what constitutes the meaning of the text and allows us to distinguish between cooking and chemical engineering is what we call context.

So, this is the kind of additional information we intend to store. And this is exactly what decontextualization does: it preserves the context – the surrounding meaning – of each piece of text.
. . .
What about content recovery?
Therefore, context retrieval is a technique used in RAG that aims to preserve the context of each episode. This way, when a chunk is returned and passed to LLM as input, we are able to preserve its original meaning as much as possible – semantics, keywords, context – everything.
To achieve this, context retrieval suggests that we first generate a helper text for each episode – that is, context text — that allows us to place a piece of text in the original document it appears in. Basically, we ask the LLM to produce this contextual text for each episode. To do this, we provide the document, along with the original passage, in one request to the LLM and tell it to “provide context to place a particular element in a document“. Information to make a text of our theme Italian cookbook the chunk will look like this:
the entire document Italian Cookbook document the chunk comes from
Here is the chunk we want to place within the context of the full document.
the actual chunk
Provide a brief context that situates this chunk within the overall
document to improve search retrieval. Respond only with the concise
context and nothing else.
LLM returns the text with the context that we combine with our original piece of text. This way, for each part of our original document, we generate a context document that explains how this part is placed in the parent document. For our example, this would be something like this:
Context: Recipe step for simmering homemade tomato pasta sauce.
Chunk: Heat the mixture slowly and stir occasionally to prevent it from sticking.
Which is very informative and straightforward! Now there is no doubt what this mysterious mixture is, because all the information needed to identify whether we are talking about tomato sauce or laboratory starch solutions are easily placed inside the same lump.
From this point on, we treat the initial chunk text and the context text as an unbroken pair. Then, other steps of RAG with hybrid search are performed in the same way. That is, we create an embedding stored in the search vector and the BM25 index for each piece of text, which depends first on its context text.

This method, as simple as it is, results in a dramatic improvement in the retrieval performance of RAG pipelines. According to Anthropic, Contextual Retrieval improves retrieval accuracy by 35%.
. . .
Cost reduction with faster caching
I hear you ask, “But won't this cost a lot of money?Surprisingly, no.
Understandably, we understand that this setup will significantly increase the import cost of the RAG pipe – basically double it, if not more. Well now we've added a bunch of extra calls to LLM, haven't we? This is true to some extent – indeed now, for each component, we make an additional call to LLM to place it inside the source document and get the content text.
However, this is a fee that we only pay once, at the document entry stage. Unlike other techniques that try to preserve context at runtime – such as pseudo-document embedding (HyDE) – context retrieval does the hard work during the document import phase. In runtime methods, additional LLM calls are required for every user query, which can quickly scale to latency and performance overhead. In contrast, contextual retrieval shifts the computation to the input phase, which means that improved retrieval quality comes with additional overhead during runtime. In addition to this, additional techniques can be used to further reduce total recovery costs. Specifically, caching can be used to generate a document summary only once and then place each segment against the generated document summary.
. . .
In my mind
Content retrieval represents a simple but powerful improvement over traditional RAG systems. By enriching each segment with contextual text, we identify its semantic place within its source document, dramatically reducing the ambiguity of each segment, and thus improving the quality of information transferred to LLM. Combined with hybrid search, this technique allows us to preserve semantics, keywords, and context at the same time.
Did you like this post? Let's be friends! Join me at:
📰A small stake 💌 In the middle 💼LinkedIn ☕Buy me a coffee!
All photos by the author, unless otherwise noted.


