Machine Learning

Hitchhiker Guide in RAG: from small files to Tolstoy with API in Openai and Langchain

I walked to you by setting a very simple pipe of rag in Python, using an Opelai ape, Langchain, and your local files. In the post, I am very covering the basics of creating a faster from your home files with Langchain, with the FISS database, to make API calls to apiis, and eventually produce appropriate answers to your files. 🌟

Photo by the writer

However, in this simple example, I only show how I can use a small tiny .txt file. In this case, I can also determine how to use large files with your RAG pipeline by adding an additional step to the process – slander.

What about wrapping?

Chunking means the Text Integration process into small pieces of texts-chunks – and be converted into the ability to install energy. This is very important because it allows us to process successfully and create the prevalence of large files. All madom models come with various limitations in the size of the previous text – I will get more information about these restrictions in the minutes. This estimated allowing better performance and low latency answers. In the case the text offerings do not meet this estimated, it will be reduced or rejection.

If we wanted to create a pipeline reading, it means from Leo Tolstoy's War and Peace The text (a large book), we could not load directly and change it to one judgment. Instead, we need to do for the first time slander – Create small chunks of text, and create each embeddy. Each chunk is under the limitations of the size or whatever the Modeli-promotion we use to convert any file to power. Therefore, a something else The real look of the RAG pipe can look as follows:

Photo by the writer

There are several parameters to continue doing the holding process and adapt to our needs. The main parameter of the process of lust is the chunk sizeAllows us to clarify that the size of each chunk will be (with letters or tokens). The plan here is that old chunks should be sufficient enough to be processed within the limitations of the correction size, but at the same time, you should also be large enough to enter logical information.

For example, let's imagine that we want to process the following sentence from War and PeaceWhen Prince Andrew thought of War:

Photo by the writer

Let's think we have created the following (rather small) chunks:

Photo by the writer

Then, if we could ask something like “Prince Andrew means 'all the same now'?”, We may not get a good answer because chunk “But isn't it now?” He thought he. does not have any context and is not clear. On the contrary, the meaning of all chunks are scattered. Therefore, or the question we ask and may be detected, no purpose to produce the correct answer. Therefore, you select the right chunk size through the rate process in accordance with the nature of the RAG documents, which may have a profound effect on the quality of the answers we will receive. Usually, the chunk content should be the idea of ​​human reading without other details, so that you can also understand the model. Finally, chunk size is – Chunks need to be sufficient enough to meet the limitations of model, but hundreds enough to maintain the meaning.

• • • •

Another parameter is an important chunk in the puppy. That's all we bet we want chunks to meet. For example, in the War and Peace For example, we found something similar to the following piles if we have chosen 5 characters.

Photo by the writer

This is also the most important decision to make because:

  • Larger overliding means many calls and tokens used in the creation of embeddlaking, which means that it is very expensive +
  • A little overliding meaning means a higher opportunity to lose the correct information between the chunk boundaries

Selecting suitable chunk depends largely on the type of text we want to process. For example, a letter of records when the language is simple and very straightforward maybe you will not need a chunkal exotic method. On the flip side, an ancient book book as War and PeaceWhen the language is very complicated and it is linked to different categories and different parts, they will probably need a chunkles for a rag to produce logical results.

• • • •

But what if all we need is a simple rag looking up to a few scripts that fit the sizes of the size or whatever is allowed in one chunk? Do we still need chuencing action, or can we just do a single embark on every text? A short response is that it is best to do chunking action, even the foundation of knowledge equal to sizes restrictions. That is because, as it is, in the face of scriptures, we face a problem of loss between – no proper information included in major documents and major resources.

What are those the odds of 'mysterious'?

Often, the application for the embodding model can include one or more chunks of the text. There are various types of limitations we should consider equally equal to the size of a text that requires creating embedding and processing. Each of those different types of limits take different amounts according to the proxy model we use. Directly, these are:

  • Chunk sizeor also high tokens to include, or content window. This is a large size in each chunk tokens. For example, with Opelai's text-embedding-3-small The model and a chunk size limit has 8,191 tokens. If we give a chunk he grows rather than a chunk size limit, in many cases, it will be quiet!
  • The number of chunks for each requestor with the installation number. There is also a limit in the number of chunks can be included in each request. For example, all Openai Penomen Models have a 2,048 installation – that is, it is, it's the end of 2,048 chunks for each application.
  • Number of tokens per application: There is also a limit to the total number of tokens of all chunks on the request. For all Openaai models, a higher number of tokens at all chunks in one application is 300,000 tokens.

So, what happens when our documents are more than 300,000 tokens? As you may have thought, the answer is that we are making many consecutive / related requests for 300,000 or fewer tokens. Many Python's libraries do this behind the scenes. For example, Langchain's OpenAIEmbeddings Whether I use my previous post office, I automatically install the batches that are under 300,000 tokens, provided that documents are already given in chunks.

Learned for large files in RAG Pipeline

Let's see how everyone played the simple Python example, using War and Peace text as a return text to the RAG. Details I use – Leo Tolstoy's War and Peace Text – has licenses as a public sector and can be obtained from the project guide.

So, first, let's try reading from the War and Peace The text without the setup of burn. In this lesson, you will need to install them langchain, openaibesides faiss Python Information Librars. We can easily install the required packages as follows:

pip install openai langchain langchain-community langchain-openai faiss-cpu

After confirmation that the libraries required is included, our simple RAG code looks like this and works well with a small and easy file .txt in the text_folder.

from openai import OpenAI # Chat_GPT API key 
api_key = "your key" 

# initialize LLM
llm = ChatOpenAI(openai_api_key=api_key, model="gpt-4o-mini", temperature=0.3)

# loading documents to be used for RAG 
text_folder =  "RAG files"  

documents = []
for filename in os.listdir(text_folder):
    if filename.lower().endswith(".txt"):
        file_path = os.path.join(text_folder, filename)
        loader = TextLoader(file_path)
        documents.extend(loader.load())

# generate embeddings
embeddings = OpenAIEmbeddings(openai_api_key=api_key)

# create vector database w FAISS 
vector_store = FAISS.from_documents(documents, embeddings)
retriever = vector_store.as_retriever()


def main():
    print("Welcome to the RAG Assistant. Type 'exit' to quit.n")
    
    while True:
        user_input = input("You: ").strip()
        if user_input.lower() == "exit":
            print("Exiting…")
            break

        # get relevant documents
        relevant_docs = retriever.invoke(user_input)
        retrieved_context = "nn".join([doc.page_content for doc in relevant_docs])

        # system prompt
        system_prompt = (
            "You are a helpful assistant. "
            "Use ONLY the following knowledge base context to answer the user. "
            "If the answer is not in the context, say you don't know.nn"
            f"Context:n{retrieved_context}"
        )

        # messages for LLM 
        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_input}
        ]

        # generate response
        response = llm.invoke(messages)
        assistant_message = response.content.strip()
        print(f"nAssistant: {assistant_message}n")

if __name__ == "__main__":
    main()

But, if I add this War and Peace The .txt file in the same folder, and try to create straightforward, I get the following error:

Photo by the writer

Ughh 🙃

So what happened here? Langchain's OpenAIEmbeddingsUnable to distinguish text into different, less than 300,000 tokens, because we do not give you chunks. It does not separate the chunk, 777,181 tokens, resulting in a request that exceeds 300,000 higher tokens per application.

• • • •

Now, let's try to set the process of drawing to create a lot of embedding from this large file. To do this, I will be using text_splitter Library of literature provided by Langchain, and directly, RecursiveCharacterTextSplitter. In RecursiveCharacterTextSplitterchunk size and chunk parameters are specified as a number of characters, but other boats such as TokenTextSplitter either OpenAITokenSplitter And allow to set these parameters as a token number.

Therefore, we can set an example of the text as below:

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

… and use it to split our first document into chunks …

split_docs = []
for doc in documents:
    chunks = splitter.split_text(doc.page_content)
    for chunk in chunks:
        split_docs.append(Document(page_content=chunk))

… then use those chunks to create embryo …

documents= split_docs

# create embeddings + FAISS index
embeddings = OpenAIEmbeddings(openai_api_key=api_key)
vector_store = FAISS.from_documents(documents, embeddings)
retriever = vector_store.as_retriever()

.....

… and voila 🌟

Now our code may well oppose the document provided, whether it is too big, and gives appropriate answers.

Photo by the writer

In my mind

Choosing a collection that fits the size and the difficulty of the Scriptures we want to feed our RAG pipeline is important for the quality of the answers we will receive. Certainly, there are several other parameters and different clicks of clicking someone who needs understanding. However, the size of a simple and beautiful chunk size and Overlap is the basis for building rags that produce logical consequences.

• • • •

Did you like this post? You received exciting data or AI project?

Let's be friends! I joined me

📰Put down 📝Medium 💼LinkedInI have purchased the coffee!

• • • •

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button