Hybrid Search and Re-Ranking in Production RAG

0 7 13 minutes read

Hybrid Search and Re-Ranking in Production RAG

, we got a complaint from one of our platform engineers of the infrastructure team that our internal knowledge assistant is confidently giving the wrong answers when asked about the retry policy for our message-queue consumers. It looks like a simple question, and the assistant should have given a well-documented answer, but I was wrong.

The system returned a three-paragraph response about exponential backoff with jitter. All of it was accurate but none of it was what she asked for. The actual document she wanted described a custom override we had put in place for one specific service after a production incident that hit us six months earlier. The document used the phrase “dead-letter queue threshold” repeatedly. Our embedding model had decided that “exponential backoff” and “dead-letter queue threshold” were semantically close enough.

I checked the retrieval logs and found out that the document she needed was sitting at position eleven in the results just outside the top ten that were passed to the model. The system had not failed to index or retrieve it but it was ranked below ten other documents.

The Problem With Dense Vectors

This is part 3 of The RAG for Enterprise series, and if you have missed the earlier parts I would strongly recommend you to check them out first: A practical guide to RAG for Enterprise Knowledge base and Your Chunks Failed Your RAG in Production

Dense retrieval works by converting text into high-dimensional vectors and finding the chunks whose vectors are geometrically close to the query vector. If two pieces of text means contextually similar things, their vectors should end up nearby in embedding space.

This holds well for conceptual queries such as “What is our incident escalation process?”. This query will retrieve incident related documents even if the document uses the words like “severity triage” instead of “escalation.” The embedding model has learned that these concepts are related, and cosine similarity captures that.

The problem arise when the specific technical language comes into play. An engineer searching for “dead-letter queue threshold configuration” is not asking a conceptual question. She wants the exact term from the exact document but the embedding representation of “dead-letter queue threshold” has been averaged out by everything else in the surrounding paragraphs into a single dense vector. This averaging is a trade off. The same property that makes dense retrieval good at conceptual matching makes it unreliable for exact term lookup.

Bi-encoders, the models that power dense retrieval, compress the meaning of an entire chunk into one fixed-size vector. This compression loses information. The question is not whether any information should be lost or not but which information we can afford to lose.

BM25: what is at and how does it help

Before neural retrieval existed, search was dominated by term frequency methods. BM25, or Best Match 25, is one of the best methods out there and it is widely used in Elasticsearch, Solr, Weaviate, and most production search systems.

BM25 scores a document against a query by looking at it from multiple directions. Its main components are:

The IDF component, which asks itself how difficult it is to find a term anywhere in the corpus. A term that appears in majority of the documents tells us almost nothing about relevance. “Dead-letter queue threshold” appearing in only a few documents is a strong signal. IDF gives rare terms more weightage.

The term frequency component asks how often the term appears in this specific document. Here BM25 gets clever and applies a saturation function rather than just using the raw frequency. Because of this saturation function the score grows rapidly in the starting and then flattens.

The length normalisation component penalises longer documents. A longer document naturally contains more term occurrences. Without normalisation it would surface the longest documents which is not what we want.

BM25 cannot match synonyms, handle paraphrases, or understand that “configuration override” and “custom settings” are related. It is a bag-of-words model and word order and semantics do not matter to it. The phrase “the configuration overrides the default retry behaviour” and “the default retry behaviour can be overridden via configuration” are identical to BM25 and that is both its strength and its limitation.

Hybrid Search: Combining Both

Weaviate supports hybrid search natively, combining BM25 keyword scores and dense vector similarity scores into a single ranked list through a method called Relative Score Fusion. The key parameter is alpha which controls the blend. An alpha of 1 is pure vector search and an alpha of 0 is pure BM25. Everything between is a weighted combination.

from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.vector_stores import MetadataFilter, MetadataFilters

# Alpha of 0.5 = equal weight to keyword and semantic signals
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=10,
    vector_store_query_mode="hybrid",
    alpha=0.5,
    vector_store_kwargs={
        "filters": MetadataFilters(filters=[
            MetadataFilter(key="department", value="engineering")
        ])
    }
)

This is the easy part, but deciding what alpha value to use is the harder part.

The question is how precise your typical query is. If most of your queries are conceptual (“how does our incident process work?”), higher alpha tilts you toward semantic matching. If queries are often exact-term lookups (“GDPR Article 17 checklist”, “retry policy DLQ threshold”, “Service X SLA”), lower alpha gives more weight to keyword matching.

In practice, you want to measure it instead of just guessing. Here is how I tuned alpha on our corpus using a labelled evaluation set of 150 query-document pairs drawn from our IT helpdesk history.

import ragas
from ragas.metrics import ContextPrecision, ContextRecall
from datasets import Dataset
import numpy as np

def evaluate_alpha(alpha_value, eval_queries, ground_truth_docs):
    results = []
    for query, expected_doc_ids in zip(eval_queries, ground_truth_docs):
        # Update retriever with new alpha
        retriever.alpha = alpha_value
        retrieved_nodes = retriever.retrieve(query)
        retrieved_ids = [n.node.metadata.get("doc_id") for n in retrieved_nodes]

        hit = any(doc_id in retrieved_ids for doc_id in expected_doc_ids)
        rank = next(
            (i + 1 for i, doc_id in enumerate(retrieved_ids) if doc_id in expected_doc_ids),
            None
        )
        results.append({"hit": hit, "rank": rank})
    
    hit_rate = np.mean([r["hit"] for r in results])
    mrr = np.mean([1 / r["rank"] if r["rank"] else 0 for r in results])
    return {"alpha": alpha_value, "hit_rate": hit_rate, "mrr": mrr}

# Test across the range
alphas = [0.0, 0.25, 0.5, 0.75, 1.0]
results = [evaluate_alpha(a, eval_queries, ground_truth_docs) for a in alphas]

for r in results:
    print(f"Alpha: {r['alpha']:.2f} | Hit Rate: {r['hit_rate']:.3f} | MRR: {r['mrr']:.3f}")

On our engineering corpus, the results looked like this (your numbers will differ):

Alpha: 0.00 | Hit Rate: 0.71 | MRR: 0.58    # Pure BM25
Alpha: 0.25 | Hit Rate: 0.80 | MRR: 0.66
Alpha: 0.50 | Hit Rate: 0.83 | MRR: 0.69    -> our sweet spot
Alpha: 0.75 | Hit Rate: 0.81 | MRR: 0.67
Alpha: 1.00 | Hit Rate: 0.73 | MRR: 0.61    # Pure dense

The pure modes were clearly worse than any blend. The dead-letter queue query that had failed before moved from rank eleven to rank four at 0.5 alpha, because the BM25 signal pulled it up even though the dense signal was still ambivalent.

An important caveat on these numbers: if your corpus is mostly long-form narrative documentation, you might find that 0.65 – 0.75 alpha works better. If it contains a lot of exact technical identifiers, error codes, product names, 0.35 – 0.5 will likely serve you better. There is no universal correct value. Measure on your own data.

Note: If your Hit Rate at alpha=0.0 (pure BM25) is substantially lower than at alpha=1.0 (pure dense), your corpus has vocabulary-rich content with paraphrase, in this case you should lean towards higher alpha. If the reverse is true, your users search with precise technical terms then you should lean towards lower alpha. If they are similar, start at 0.5 and tune it from there.

The Problem that Hybrid Search Does Not Solve

When you pass ten retrieved chunks to an LLM, it reads all of them. But its attention is not uniform across the context window. Multiple studies like the “lost in the middle” problem have shown that models pay more attention to context near the beginning and end of their input and are less reliable about information buried in the middle. If the most relevant chunk is sitting at position eight of ten, you are relying on the model to fish it out from a relatively inattentive zone.

Cross-Encoders: What They Are and Why They Work

A bi-encoder processes the query and the document completely independently. By the time it computes similarity, the two finalised vectors doesn’t knows about the other.

A cross-encoder does something different. It takes the query and a single document, concatenates them into one input sequence, and runs them through the transformer together. Every attention layer in the model can now let query tokens attend to document tokens and vice versa. The model sees the full interaction between the two before producing its relevance score.

The difference in what the model can detect is substantial. Consider the query “What is the retry limit for the payment service?” and two candidate chunks: “Retry limits vary by service type. For most internal services, the default is three attempts with exponential backoff.” and “The payment service consumer is configured with a maximum of five retry attempts before the message is routed to the dead-letter queue.”

A bi-encoder might rank Chunk A higher because “retry limit” and “retry limits vary by service type” are semantically close. A cross-encoder reading both texts together immediately notices that Chunk B contains the actual number for the specific service the query asks about. The cross-attention between “payment service” in the query and “payment service consumer” in the document gives it direct evidence that Chunk B is the right answer.

This is why cross-encoders are substantially more accurate than bi-encoders on ranking tasks. The limitation is that they cannot pre-compute anything. A bi-encoder can pre-embed all your documents at index time and then just embed the query at search time hence performing two operations. A cross-encoder must process every (query, document) pair at query time. For a million documents, that is a million forward passes. You cannot run a cross-encoder over your entire corpus.

The solution is the two-stage funnel: use the bi-encoder to cast a wide net and retrieve the top N candidates quickly, then use the cross-encoder to re-score only those N candidates precisely.

In practice, the cross-encoder adds roughly 80 – 120 ms to query latency when re-ranking 20 documents with a lightweight model on CPU.

Implementing Re-ranking

We use ms-marco-MiniLM-L-6-v2 from the sentence-transformers library. It was trained on MS MARCO, a large-scale question answering dataset, and it is the most widely used open-source cross-encoder for general retrieval tasks. For domain-specific content you can fine-tune on your own labelled pairs, but the general model is a reasonable starting point.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank_nodes(query: str, retrieved_nodes: list, top_n: int = 5) -> list:
    """
    Takes query and a list of LlamaIndex NodeWithScore objects,
    returns top_n nodes reranked by cross-encoder score.
    """
    # Build (query, chunk_text) pairs for the cross-encoder
    pairs = [(query, node.node.get_content()) for node in retrieved_nodes]
    
    # Score all pairs and returns a list of floats
    scores = reranker.predict(pairs)
    
    # Attach scores to nodes and sort
    for node, score in zip(retrieved_nodes, scores):
        node.score = float(score)
    
    reranked = sorted(retrieved_nodes, key=lambda n: n.score, reverse=True)
    return reranked[:top_n]


# Full retrieval + re-ranking pipeline
query = "What is the retry limit for the payment service dead-letter queue?"

# Stage 1: retrieve more than you need (20 candidates)
retrieved = retriever.retrieve(query)   # top_k=20 in retriever config

# Stage 2: re-rank down to 5
reranked = rerank_nodes(query, retrieved, top_n=5)

# Inspect what happened to document ranks
print("After re-ranking:")
for i, node in enumerate(reranked):
    source = node.node.metadata.get("source", "unknown")
    print(f"  Rank {i+1} | Score: {node.score:.4f} | Source: {source}")

LlamaIndex also has a native SentenceTransformerRerank post-processor that integrates into its query pipeline cleanly:

from llama_index.postprocessor.sbert_rerank import SentenceTransformerRerank
from llama_index.core import QueryBundle
from llama_index.core.query_engine import RetrieverQueryEngine

reranker_postprocessor = SentenceTransformerRerank(
    model="cross-encoder/ms-marco-MiniLM-L-6-v2",
    top_n=5
)

query_engine = RetrieverQueryEngine.from_args(
    retriever=retriever,
    node_postprocessors=[reranker_postprocessor]
)

response = query_engine.query(
    "What is the retry limit for the payment service dead-letter queue?"
)

top_n=5 here tells the reranker how many documents to pass to the generation step. Increasing this gives the LLM more context but increases both latency and the risk of noise. In our system, 5 was the sweet spot, enough context for multi-part questions without cluttering the prompt.

Note: Log the rank correlation between retrieval order and re-ranking order across a sample of queries. If the cross-encoder is rarely changing the ranking, either your bi-encoder is already doing a good job and re-ranking adds little value, or your cross-encoder model is too generic for your domain.

Measuring the Impact

We are comparing the three things together, pure dense retrieval (our baseline), hybrid search, and hybrid + re-ranking. I ran our evaluation set of 150 queries through all three configurations.

from ragas import evaluate
from ragas.metrics import (
    ContextPrecision,
    ContextRecall,
    AnswerRelevancy,
    Faithfulness
)
from datasets import Dataset

def build_ragas_dataset(queries, retrieved_contexts, ground_truths, generated_answers):
    return Dataset.from_dict({
        "question": queries,
        "contexts": retrieved_contexts,    # list of lists of strings
        "answer": generated_answers,
        "ground_truth": ground_truths
    })

# Build datasets for each configuration, then evaluate
baseline_dataset = build_ragas_dataset(
    queries, baseline_contexts, ground_truths, baseline_answers
)
hybrid_dataset = build_ragas_dataset(
    queries, hybrid_contexts, ground_truths, hybrid_answers
)
hybrid_rerank_dataset = build_ragas_dataset(
    queries, hybrid_rerank_contexts, ground_truths, hybrid_rerank_answers
)

metrics = [ContextPrecision(), ContextRecall(), AnswerRelevancy(), Faithfulness()]

baseline_result = evaluate(baseline_dataset, metrics=metrics)
hybrid_result = evaluate(hybrid_dataset, metrics=metrics)
hybrid_rerank_result = evaluate(hybrid_rerank_dataset, metrics=metrics)

The results on our engineering corpus (these are real numbers from our internal evaluation):

Configuration               | Context Precision | Context Recall | Answer Relevancy | Faithfulness
----------------------------|-------------------|----------------|------------------|-------------
Dense only (alpha=1.0)      |       0.61        |      0.74      |       0.78       |    0.82
Hybrid (alpha=0.5)          |       0.71        |      0.83      |       0.81       |    0.85
Hybrid + Re-ranking (top 5) |       0.79        |      0.84      |       0.87       |    0.89

A few things worth noting:

Context Recall improved substantially from dense to hybrid (0.74 to 0.83), then barely moved with re-ranking (0.84). It happened because recall measures whether the right document was retrieved or not. Re-ranking does not help recall because it works within the already-retrieved set. The hybrid improvement on recall was the BM25 component pulling in exact-term matches that the dense model had ranked too low.

Context Precision jumped significantly with re-ranking (0.71 to 0.79). Precision measures what proportion of the retrieved chunks are actually relevant. Re-ranking is doing exactly what it should, pushing the irrelevant material out of the top 5 that gets passed to the generation step.

Answer Relevancy and Faithfulness both improved at each stage. These are end-to-end metrics and they reflect the cumulative benefit of better retrieval flowing through to better generation.

Metadata Filtering

Metadata filtering lets you narrow the retrieval space before you run vector search. If a user is in the engineering department asking about deployment processes, you can restrict the search to engineering documents before the embedding comparison even starts. The result is not just faster retrieval but a smaller and more relevant candidate pool that makes both BM25 and dense scoring more accurate.

from llama_index.core.vector_stores import (
    MetadataFilter,
    MetadataFilters,
    FilterOperator,
    FilterCondition
)

# Apply filters based on user context
def build_retriever_with_filters(
    index,
    user_department: str,
    max_doc_age_days: int = 365,
    classification_level: str = "internal"
):
    from datetime import datetime, timedelta
    
    cutoff_date = (datetime.now() - timedelta(days=max_doc_age_days)).isoformat()
    
    filters = MetadataFilters(
        filters=[
            MetadataFilter(
                key="department",
                value=user_department,
                operator=FilterOperator.EQ
            ),
            MetadataFilter(
                key="updated_at",
                value=cutoff_date,
                operator=FilterOperator.GT
            ),
            MetadataFilter(
                key="classification",
                value="confidential",
                operator=FilterOperator.NE  # Exclude confidential unless authorised
            ),
        ],
        condition=FilterCondition.AND
    )
    
    return VectorIndexRetriever(
        index=index,
        similarity_top_k=20,
        vector_store_query_mode="hybrid",
        alpha=0.5,
        vector_store_kwargs={"filters": filters}
    )

A runbook for a service that was decommissioned eighteen months ago is not just unhelpful but it is also dangerous if the system surfaces it as a confident answer to a question about current infrastructure. Filtering by updated_at can help us in this case by not letting the old information to surface.

One failure mode to be aware of: if your filter is too narrow and excludes the document that actually contains the answer, you will get a wrong answer delivered confidently from the remaining documents. The right approach is to start with sensible defaults (department filter, maybe a date filter) and add stricter filters only for documented use cases where you have verified they help on real queries.

The Complete Pipeline

Here is how all three components fit together in a retrieval flow (for demo):

from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.postprocessor.sbert_rerank import SentenceTransformerRerank
from llama_index.core.response_synthesizers import get_response_synthesizer
from llama_index.llms.ollama import Ollama

# LLM (local, via Ollama)
llm = Ollama(model="llama3", request_timeout=120.0)

# Stage 1: Hybrid retriever with metadata filters
retriever = build_retriever_with_filters(
    index=index,
    user_department="engineering",
    max_doc_age_days=365
)

# Stage 2: Cross-encoder re-ranker
reranker = SentenceTransformerRerank(
    model="cross-encoder/ms-marco-MiniLM-L-6-v2",
    top_n=5
)

# Stage 3: Response synthesizer
synthesizer = get_response_synthesizer(
    llm=llm,
    response_mode="compact",  # Merges multiple chunks into one prompt
    use_async=True
)

# Assemble the query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    node_postprocessors=[reranker],
    response_synthesizer=synthesizer
)

# Query it
response = query_engine.query(
    "What is the retry limit for the payment service dead-letter queue?"
)

print(response.response)

# Source attribution: important for enterprise use cases
for node in response.source_nodes:
    print(f"  Source: {node.node.metadata.get('source')} | Score: {node.score:.4f}")

One thing to note is the response_mode="compact" setting: this mode merges multiple retrieved chunks into a single prompt call rather than making one LLM call per chunk. For five chunks it reduces latency substantially and keeps the context window usage manageable. If you are using a model with a smaller context limit, or your chunks are long, response_mode="tree_summarise" is an alternative that processes in stages.

After making all these changes and moving it into the production our internal knowledge assistant was able to answer the question about the retry policy for our message queue consumers with the correct data.

One Final note on RAGAS

A RAGAS score is not a product quality metric. It is a diagnostic instrument. A Context Precision of 0.79 tells you that, on average, 79% of what you are passing to the model is relevant. It does not tell you that 79% of your users are getting correct answers.

RAGAS is actually useful in measuring the impact of changes. When you introduced hybrid search, did Context Recall go up? When you added re-ranking, did Context Precision improve without destroying Recall? These are the questions it can answer reliably. Use it as a before/after instrument every time you change something in the retrieval pipeline, and track the numbers over time as your corpus evolves.

Where This Leaves the Series

In the first article, we built the indexing pipeline: ingesting documents from Confluence and local directories with LlamaIndex, chunking them, embedding with BGE-large, and storing in Weaviate. In the second, we went deep on chunking strategies and learned that the shape of the chunks determines what the retrieval system can and cannot find.

In this article, we have addressed the retrieval quality problem directly. Hybrid search gave us meaningful recall improvements on exact-term queries. Cross-encoder re-ranking improved the precision of what we pass to the LLM. Metadata filtering kept stale and irrelevant documents out of the candidate pool before the expensive scoring steps even ran. The RAGAS numbers showed cumulative improvement at each stage.

Stay tuned for the next article in this series where we are going to solve another question that challenges the engineers in building production RAG systems.

Source link

nimda 3 weeks ago

0 7 13 minutes read