ANI

7 steps to making a simple rag pattern from scratch

nimda November 17, 2025

0 9 9 minutes read

7 steps to making a simple rag pattern from scratch

Photo by the Author

The obvious Getting started

These days, almost everyone uses chatgpt, Gemini, or another large language model (LLM). They make life easier but they still get things wrong. For example, I remember asking a model who won the most recent US election and getting the name of the previous President. It sounded convincing, but the model was simply reduced to training details before the election took place. This is where the ReadEVAL-Augmented Generation Generator (Rag) helps LLMS provide accurate and timely responses. Instead of relying solely on the model's internal information, it pulls information from external sources – such as PDFs, documents, or APIs – and uses that to create more and more reliable feedback. In this guide, I'll walk you through seven practical steps to build a simple rag rug from scratch.

The obvious Understanding the Disruption of the Careers of Under-Fables

Before we continue with the code, here's an overview of the transparent values. The rag system has two main parts: to retrieve once generator or other. The retriever searches your database and extracts the most relevant chunks of text. The generator is a language model that takes those snippets and turns them into a natural, useful answer. The process is straightforward, as follows:

The user asks a question.
The retriever searches your documents or your database and retrieves the best matching roles.
Those roles are offered by the LLM as a core.
LLM then generates a response based on this retrieved context.

Now we're going to break that flow down into seven easy steps and build to an end.

The obvious Step 1: Enter the details

Although large language models already know a lot from books and Web information, they are not available for your private or newly produced information such as research notes, company documents, or project files. Rag helps you feed your data model, reduce Hallucinations and making responses more accurate and timely. For the sake of this article, we'll keep things simple and use a few short text files about machine learning concepts.

data/
 ├── supervised_learning.txt
 └── unsupervised_learning.txt

supervised_learning.txt:
In this type of machine learning (supervised), the model is trained on labeled data. 
In simple terms, every training example has an input and an associated output label. 
The objective is to build a model that generalizes well on unseen data. 
Common algorithms include:
- Linear Regression
- Decision Trees
- Random Forests
- Support Vector Machines

Classification and regression tasks are performed in supervised machine learning.
For example: spam detection (classification) and house price prediction (regression).
They can be evaluated using accuracy, F1-score, precision, recall, or mean squared error.

unsupervised_learning.txt:
In this type of machine learning (unsupervised), the model is trained on unlabeled data. 
Popular algorithms include:
- K-Means
- Principal Component Analysis (PCA)
- Autoencoders

There are no predefined output labels; the algorithm automatically detects 
underlying patterns or structures within the data.
Typical use cases include anomaly detection, customer clustering, 
and dimensionality reduction.
Performance can be measured qualitatively or with metrics such as silhouette score 
and reconstruction error.

The next task is to load this data. For that, we will create a python file, load_data.py:

import os

def load_documents(folder_path):
    docs = []
    for file in os.listdir(folder_path):
        if file.endswith(".txt"):
            with open(os.path.join(folder_path, file), 'r', encoding='utf-8') as f:
                docs.append(f.read())
    return docs

Before we use the data, we will clean it. If the text is dirty, the model can return inappropriate or incorrect passages, increasing hallucinations. Now, let's create another python file, clean_data.py:

import re

def clean_text(text: str) -> str:
    text = re.sub(r's+', ' ', text)
    text = re.sub(r'[^x00-x7F]+', ' ', text)
    return text.strip()

Finally, combine everything into a new file called prepare_data.py to upload and clean your documents together:

from load_data import load_documents
from clean_data import clean_text

def prepare_docs(folder_path="data/"):
    """
    Loads and cleans all text documents from the given folder.
    """
    # Load Documents
    raw_docs = load_documents(folder_path)

    # Clean Documents
    cleaned_docs = [clean_text(doc) for doc in raw_docs]

    print(f"Prepared {len(cleaned_docs)} documents.")
    return cleaned_docs

The obvious Step 2: Convert the text into chunks

LLMS has a good one context – E.g. They can process a limited amount of text at the same time. We solve this by splitting long documents into shorter, more compact chunks (the number of words in a chunk is usually 300 to 500 words). We will use Langchain' RecursiveCharacterTextSplitterwhich divides text into natural points such as sentences or paragraphs. Each piece is logical, and the model can quickly find the right piece while answering.

split_text.py

from langchain.text_splitter import RecursiveCharacterTextSplitter

def split_docs(documents, chunk_size=500, chunk_overlap=100):
 
   # define the splitter
   splitter = RecursiveCharacterTextSplitter(
       chunk_size=chunk_size,
       chunk_overlap=chunk_overlap
   )

   # use the splitter to split docs into chunks
   chunks = splitter.create_documents(documents)
   print(f"Total chunks created: {len(chunks)}")

   return chunks

Chunking helps the model to understand the text without losing its meaning. If we don't add a little overlap between the pieces, the model can get confused at the edges, and the answer might not be visible.

The obvious Step 3: Creating and Maintaining Vector Restraints

A computer does not understand textual information; It only understands numbers. Therefore, we need to convert our text chunks into numbers. These numbers are called vector declarations, and they help the computer to understand the meaning behind the text. We can use tools like OpenAi, SevencetranRanger, or to block faces for this. Let's create a new file called create_embeddings.py and use SentencetranRan to generate the embedding.

from sentence_transformers import SentenceTransformer
import numpy as np

def get_embeddings(text_chunks):
  
   # Load embedding model
   model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
  
   print(f"Creating embeddings for {len(text_chunks)} chunks:")
   embeddings = model.encode(text_chunks, show_progress_bar=True)
  
   print(f"Embeddings shape: {embeddings.shape}")
   return np.array(embeddings)

Each shipment of the vendor holds its own semantic definition. Chunks of the same script will be embedded close to each other in the veter field. Now we will end up embedding in a vector database like faiss (Facebook AI similarity matching), Chroma, or pinecone. This helps in searching for similar matches. For example, let's use Faiss (Lightweight, local option). You can install it using:

Next, let's create a file called store_faiss.py. First, we make the necessary imports:

import faiss
import numpy as np
import pickle

Now we will create a fiiss index from our import using a function build_faiss_index().

def build_faiss_index(embeddings, save_path="faiss_index"):
   """
   Builds FAISS index and saves it.
   """
   dim = embeddings.shape[1]
   print(f"Building FAISS index with dimension: {dim}")

   # Use a simple flat L2 index
   index = faiss.IndexFlatL2(dim)
   index.add(embeddings.astype('float32'))

   # Save FAISS index
   faiss.write_index(index, f"{save_path}.index")
   print(f"Saved FAISS index to {save_path}.index")

   return index

Each embedding represents a chunk of text, and helps faiss update the next time the user asks a question. Finally, we need to save all the documents (their metadata) into pickle File for easy reloading at a later time to retrieve.

def save_metadata(text_chunks, path="faiss_metadata.pkl"):
   """
   Saves the mapping of vector positions to text chunks.
   """
   with open(path, "wb") as f:
       pickle.dump(text_chunks, f)
   print(f"Saved text metadata to {path}")

The obvious Step 4: Retrieve relevant information

In this step, the user's question is first converted into a numerical form, like what we did with all the scripts before. The computer then compared the numerical values of the chunks with the query vector to find the closest ones. This process is called search for the same.
Let's create a new file called retrieve_faiss.py and make imports as needed:

import faiss
import pickle
import numpy as np
from sentence_transformers import SentenceTransformer

Now, create a function to load the previous faiss index from disk for searching.

def load_faiss_index(index_path="faiss_index.index"):
    """
    Loads the saved FAISS index from disk.
    """
    print("Loading FAISS index.")
    return faiss.read_index(index_path)

We will also need another function that loads the metadata, which contains the documents that contain the documents we saved earlier.

def load_metadata(metadata_path="faiss_metadata.pkl"):
    """
    Loads text chunk metadata (the actual text pieces).
    """
    print("Loading text metadata.")
    with open(metadata_path, "rb") as f:
        return pickle.load(f)

Chunks of the original text are stored in a metadata file (faiss_metadata.pkl) and used to map to Faiss results in unreadable text. At this point, we'll be creating another function that takes the user's query, embeds it, and retrieves the corresponding chunks from the Faiss Index. Semantic search takes place here.

def retrieve_similar_chunks(query, index, text_chunks, top_k=3):
    """
    Retrieves top_k most relevant chunks for a given query.
  
    Parameters:
        query (str): The user's input question.
        index (faiss.Index): FAISS index object.
        text_chunks (list): Original text chunks.
        top_k (int): Number of top results to return.
  
    Returns:
        list: Top matching text chunks.
    """
  
    # Embed the query
    model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
    # Ensure query vector is float32 as required by FAISS
    query_vector = model.encode([query]).astype('float32')
  
    # Search FAISS for nearest vectors
    distances, indices = index.search(query_vector, top_k)
  
    print(f"Retrieved top {top_k} similar chunks.")
    return [text_chunks[i] for i in indices[0]]

This gives you high quality chunks of text to use as context.

The obvious Step 5: Assembling the recovered context

Once we have the most relevant chunks, the next step is to combine them into a single context block. This context is then expanded into the user's query before passing it to the LLM. This step ensures that the model has all the information needed to produce accurate and basic answers. You can combine chunks like this:

context_chunks = retrieve_similar_chunks(query, index, text_chunks, top_k=3)
context = "nn".join(context_chunks)

This combined scenario will be used later when we develop FINT TRESST for LLM.

The obvious Step 6: You use a large language model to generate the answer

Now, we combine the retrieved context with the user query and feed it through LLM to generate the final answer. Here, we will use an open source model freely available from face sink, but you can use any model you like.

Let's create a new file called generate_answer.py then add the import:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from retrieve_faiss import load_faiss_index, load_metadata, retrieve_similar_chunks

Now define the function generate_answer() Do the complete process:

def generate_answer(query, top_k=3):
    """
    Retrieves relevant chunks and generates a final answer.
    """
    # Load FAISS index and metadata
    index = load_faiss_index()
    text_chunks = load_metadata()

    # Retrieve top relevant chunks
    context_chunks = retrieve_similar_chunks(query, index, text_chunks, top_k=top_k)
    context = "nn".join(context_chunks)

    # Load open-source LLM
    print("Loading LLM...")
    model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
    # Load tokenizer and model, using a device map for efficient loading
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

    # Build the prompt
    prompt = f"""
    Context:
    {context}
    Question:
    {query}
    Answer:
    """

    # Generate output
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    # Use the correct input for model generation
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=200, pad_token_id=tokenizer.eos_token_id)
    
    # Decode and clean up the answer, removing the original prompt
    full_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Simple way to remove the prompt part from the output
    answer = full_text.split("Answer:")[1].strip() if "Answer:" in full_text else full_text.strip()
    
    print("nFinal Answer:")
    print(answer)

The obvious Step 7: Running Retrieval-Augment Retrieval-Augmented Generation Wipeline

This last step brings everything together. We will build a main.py The default file is the entire work flow from data loading to producing the final response.

# Data preparation
from prepare_data import prepare_docs
from split_text import split_docs

# Embedding and storage
from create_embeddings import get_embeddings
from store_faiss import build_faiss_index, save_metadata

# Retrieval and answer generation
from generate_answer import generate_answer

Now define the main function:

def run_pipeline():
    """
    Runs the full end-to-end RAG workflow.
    """
    print("nLoad and Clean Data:")
    documents = prepare_docs("data/")
    print(f"Loaded {len(documents)} clean documents.n")

    print("Split Text into Chunks:")
    # documents is a list of strings, but split_docs expects a list of documents
    # For this simple example where documents are small, we pass them as strings
    chunks_as_text = split_docs(documents, chunk_size=500, chunk_overlap=100)
    # In this case, chunks_as_text is a list of LangChain Document objects

    # Extract text content from LangChain Document objects
    texts = [c.page_content for c in chunks_as_text]
    print(f"Created {len(texts)} text chunks.n")

    print("Generate Embeddings:")
    embeddings = get_embeddings(texts)
  
    print("Store Embeddings in FAISS:")
    index = build_faiss_index(embeddings)
    save_metadata(texts)
    print("Stored embeddings and metadata successfully.n")

    print("Retrieve & Generate Answer:")
    query = "Does unsupervised ML cover regression tasks?"
    generate_answer(query)

Finally, run through the pipeline:

if __name__ == "__main__":
    run_pipeline()

Output:

Screenshots of the release | Photo by the Author

The obvious Wrapping up

Rag bridges the gap between the LLM “already know” and the ever-changing details of the world. I used a very basic hose so you can understand how the rag works. At the enterprise level, many advanced concepts, such as adding guardrails, hybrid search, broadcast, and contextual processing techniques are used. If you're interested in exploring higher-level concepts, here are my personal favorites:

Kanwal Mehreen Is a machine learning engineer and technical writer with a strong interest in data science and the intersection of AI and medicine. Authored the eBook “Increasing Productivity with Chatgpt”. As a Google Event 2022 APAC host, she is a symbol of diversity and excellence in education. He has also been recognized as a teradata distinction in tech scholar, a mitacs Globalk research scholar, and a Harvard WeCode Scholar. Kanwal is a passionate advocate for change, who has created femcodes to empower women.

Source link

nimda November 17, 2025

0 9 9 minutes read