Rag explained: Updating better answers

We considered how the RAG pipe processing process works. The RAG in the RAG in the RAG, the relevant documents from the foundation of information are identified and received based on how they are like the user's question. Specially, the matching of each text chunk is based on the retric metric, such as Cosyine, or DOT to the highest chunks are very similar to the user question.
Unfortunately, same scores always Verify complete compliance. In other words, refund can return a text roll with the same points like, but it is actually not helpful – not just what we need to answer our user's question đ¤ˇđťâď¸. And that's it Re-settling Introduced, as a means of analyzing the results before feeding on the llm.
As in my previous post, I will use War and Peace The text as an example, licensed as a public sector and is readily available through the project Gutenberg.
⢠⢠⢠â˘
What about Reranking?
Scriptures only have been received based on retrieval metric – that is, Rated restoration– May not work so much for different reasons:
- The restored chunks keep separately separate from the selected number of high chunks. According to the number K the highest chunks we receive, we can get very different results.
- We can get back the chunks and come across what we want, but yet what is not in the article again, in fact, it should not answer the user's question.
- We can find the parallels which part of some words are included in the user question, which leads to chunks to include those specific words but actually do not work.
Back to my favorite question from 'War and Peace' For example, if we ask 'Who is Anna PĂĄvlovna?'And use Ik (such as K = 2), chunks found may contain enough information to answer the full question. On the other hand, if we allow a large amount of chunks k shall be obtained (Say = 20), we will probably bring back some unfit chunks when 'Anna pĂĄvlovna' Just mentioned, but it's not a chunk title. Therefore, some of those chunks will coincide with the user's question and we do not have to refere. Therefore, we need a way to distinguish the strips of the text that really suited without all chunks received.
Here, it is appropriate to specify that one specific solution to this issue will simply return all and pass through the generation station (to the llm). Unfortunately, this cannot be made in a bundle of reasons, such as llms have a Windows Windows, or that the performance of the llms is very bad where the information appears.
Therefore, this is a problem we are trying to deal with by introducing a Reranking action. At the Essure, vandalizing means re-evaluating chunks is found based on the same Cosline scores with an accurate, but very expensive.
There are various ways to do this, because of the example, encomers named Corser, using the llm to make a Reranking, or to use the Heuristics. Eventually, by taking in this additional Reranking step, we actually use what is called two categories with Reranking, which is a common form of industry. This allows enhancing the compliance with the results of the text received and, as a result, the quality of the responsibilities produced.
So, let's look at a detail … đ
⢠⢠⢠â˘
Reranking with Cross-Encoder
Cross-encoders common models are used for Reranking with RAG frame. Unlike returning activities used in the first return action, we simply look at many different text ratings, encod-deposits can make a deep comparison of the text. Directly, cross encoder By combining the document and user question and produces the same points. On the flip, in Cosline return from Cosine, the document and the user question is divorced. As a result, some of the original documents are lost when it creates divorce, and some details are stored in the text. As a result, the cross encoler can better screen the compatibility between two documents (ie, user's question and document).
So why don't you use cross-encoder in the first place? The answer is this because carers-encod altogether slightly. For example, the same Cosline search searches 1 000 verses take less than millisecond. On the contrary, you only use a cross-encoder (such as ms-marco-MiniLM-L-6-v2) To search for the same collection of 1,000 verses and one question such as orders can be orders-of-majestic!
This is expected when you think about it, because you use a cross-encoder means that we should have a Memorial for each of the information and the information and embark on it, and every new question. On the contrary, returning the Cosline based on, we find creating all preoccupy domain, and once and the user embarking on the user's question and calculates Cosse.
For that reason, we change our RAG pipe accordingly and we get the best of both worlds; First, we reduce the appropriate chunks according to the same Cosnine search, and, in the second step, we check the same chunks reductions with more accuracy.
⢠⢠⢠â˘
Back to the “War and Peace” Example “
So now let's see how all this is playing in 'War and Peace' Example by answering one time I love my favorite question – 'Who is Anna PĂĄvlovna? '.
My code until now looks something like this:
import os
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import TextLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document
import faiss
api_key = "my_api_key"
# initialize LLM
llm = ChatOpenAI(openai_api_key=api_key, model="gpt-4o-mini", temperature=0.3)
# initialize embeddings model
embeddings = OpenAIEmbeddings(openai_api_key=api_key)
# loading documents to be used for RAG
text_folder = "RAG files"
documents = []
for filename in os.listdir(text_folder):
if filename.lower().endswith(".txt"):
file_path = os.path.join(text_folder, filename)
loader = TextLoader(file_path)
documents.extend(loader.load())
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
split_docs = []
for doc in documents:
chunks = splitter.split_text(doc.page_content)
for chunk in chunks:
split_docs.append(Document(page_content=chunk))
documents = split_docs
# normalize knowledge base embeddings
import numpy as np
def normalize(vectors):
vectors = np.array(vectors)
norms = np.linalg.norm(vectors, axis=1, keepdims=True)
return vectors / norms
doc_texts = [doc.page_content for doc in documents]
doc_embeddings = embeddings.embed_documents(doc_texts)
doc_embeddings = normalize(doc_embeddings)
# faiss index with inner product
import faiss
dimension = doc_embeddings.shape[1]
index = faiss.IndexFlatIP(dimension) # inner product index
index.add(doc_embeddings)
# create vector database w FAISS
vector_store = FAISS(embedding_function=embeddings, index=index, docstore=None, index_to_docstore_id=None)
vector_store.docstore = {i: doc for i, doc in enumerate(documents)}
def main():
print("Welcome to the RAG Assistant. Type 'exit' to quit.n")
while True:
user_input = input("You: ").strip()
if user_input.lower() == "exit":
print("ExitingâŚ")
break
# embedding + normalize query
query_embedding = embeddings.embed_query(user_input)
query_embedding = normalize([query_embedding])
# search FAISS index
D, I = index.search(query_embedding, k=2)
# get relevant documents
relevant_docs = [vector_store.docstore[i] for i in I[0]]
retrieved_context = "nn".join([doc.page_content for doc in relevant_docs])
# D contains inner product scores == cosine similarities (since normalized)
print("nTop chunks and their cosine similarity scores:n")
for rank, (idx, score) in enumerate(zip(I[0], D[0]), start=1):
print(f"Chunk {rank}:")
print(f"Cosine similarity: {score:.4f}")
print(f"Content:n{vector_store.docstore[idx].page_content}n{'-'*40}")
# system prompt
system_prompt = (
"You are a helpful assistant. "
"Use ONLY the following knowledge base context to answer the user. "
"If the answer is not in the context, say you don't know.nn"
f"Context:n{retrieved_context}"
)
# messages for LLM
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_input}
]
# generate response
response = llm.invoke(messages)
assistant_message = response.content.strip()
print(f"nAssistant: {assistant_message}n")
if __name__ == "__main__":
main()
For K = 2, we get the following upper chunks found.

But, if we are at the Tone K = 6, we find the following chunks found, and they find the more educational feedback, which contains additional information about our question, as a fact that he 'Maicaid of a man and learning of Empress MĂĄrya FĂŤdorovna'.

Now, let's change our code to restart those 6 chunks and see that the surface 2 remains the same. To do this, we will use the Cross-Encoder model to re-enter the high-KO-KO texts of K before transferring your llm. Clearly, I will use the Cross-Encoder / MS-Trinybert-L2 Cross-Encoder, a simple model, trained in Cross-Encoding, running on top of the Pyterch. To do so, we need to import torch including transformers Libraries.
import torch
from transformers import CrossEncoder
Then we can start the Cross-Encoder and explain the work of recovering K chunks high in view of the Vector:
# initialize cross-encoder model
cross_encoder = CrossEncoder('cross-encoder/ms-marco-TinyBERT-L-2', device='cuda' if torch.cuda.is_available() else 'cpu')
def rerank_with_cross_encoder(query, relevant_docs):
pairs = [(query, doc.page_content) for doc in relevant_docs] # pairs of (query, document) for cross-encoder
scores = cross_encoder.predict(pairs) # relevance scores from cross-encoder model
ranked_indices = np.argsort(scores)[::-1] # sort documents based on cross-encoder score (the higher, the better)
ranked_docs = [relevant_docs[i] for i in ranked_indices]
ranked_scores = [scores[i] for i in ranked_indices]
return ranked_docs, ranked_scores
… and change this activity:
...
# search FAISS index
D, I = index.search(query_embedding, k=6)
# get relevant documents
relevant_docs = [vector_store.docstore[i] for i in I[0]]
# rerank with our function
reranked_docs, reranked_scores = rerank_with_cross_encoder(user_input, relevant_docs)
# get top reranked chunks
retrieved_context = "nn".join([doc.page_content for doc in reranked_docs[:2]])
# D contains inner product scores == cosine similarities (since normalized)
print("nTop 6 Retrieved Chunks:n")
for rank, (idx, score) in enumerate(zip(I[0], D[0]), start=1):
print(f"Chunk {rank}:")
print(f"Similarity: {score:.4f}")
print(f"Content:n{vector_store.docstore[idx].page_content}n{'-'*40}")
# display top reranked chunks
print("nTop 2 Re-ranked Chunks:n")
for rank, (doc, score) in enumerate(zip(reranked_docs[:2], reranked_scores[:2]), start=1):
print(f"Rank {rank}:")
print(f"Reranker Score: {score:.4f}")
print(f"Content:n{doc.page_content}n{'-'*40}")
...
… ultimately, these are 2 upper chunks, and the correct answer we receive, after the CROSS-Encoder:

Note how different 2 chunks are in the top 2 chunks we have received from the vector search.
Therefore, the value of Reranking step is rendered well. We use the Vector search to reduce appropriate chunks, without all the documents available on the basis of information, but then use the Renew step to identify the most relevant chunks.

We can imagine two steps as funnel: The first phase pulls a broad set of chunks, and the closure of the revival stage. The abandoned context is very useful, leading to clear and more accurate perspectives.
⢠⢠⢠â˘
In my mind
Therefore, it becomes clear action to create a powerful pipe. Basically, it allows us to block the gap between prompt search but not direct specific, as well as the answers to the situation. By making two-step returns, the Vector search is the first step, and the second step of Reranking, we find the best worldwide: efficient performance with high-quality answers. In fact, the two stage method is what makes today's pipes work inequal.
⢠⢠⢠â˘
Did you like this post? Let's be friends! I joined in:
đ°Put down đ Medium đźLinkedIn âI bought coffee!
⢠⢠⢠â˘
What about pialgorithms?
You want to bring the power of RAG to your organization?
Pialgorithms You can do for you đ Radio Today!



