Machine Learning

You Probably Don't Need a Vector Database for Your RAG – Yet

after Retrieval Augmented Generation (RAG), vector databases are getting a lot of attention in the world of AI.

Many people say you need tools like Pinecone, Weaviate, Milvus, or Qdrant to build a RAG system and manage your embedding. If you're working on enterprise applications with hundreds of millions of vectors, then tools like these are essential. They allow you to perform CRUD operations, filter by metadata, and use a disk-based index that bypasses your computer's memory.

But for many internal tools, document bots, or MVP agents, adding a dedicated vector database can be overkill. It increases complexity, network latency, adds integration costs, and makes things more difficult to manage.

The truth is that “Vector Search” (ie, the retrieval part of RAG) is just matrix multiplication. And Python already has some of the best tools in the world for that.

In this article, we'll show you how to build a production-friendly one recovery part of the RAG pipeline for small to medium volumes using only NumPy and SciKit-Learn. You will see that it is possible to search millions of characters of text in milliseconds, all in memory and without external dependencies.

Understanding Regression as Matrix Math

In general, RAG includes four main steps:

  1. Embed: Convert your text source data to vectors (arrays of floating point numbers)
  2. Store: Extract those vectors into the database
  3. Return: Find vectors that are mathematically “close” to the query vector.
  4. Produce: Feed the LLM the corresponding text and receive your final answer.

Steps 1 and 4 depend on the macro-language models. Steps 2 and 3 are the background of Vector DB. We will focus on parts 2 and 3 and how to avoid using vector DBs altogether.

But if we search our vector database, what exactly is “proximity”? Usually, it is Cosine similarity. If your two vectors are normalized to have magnitude 1, then the cosine parallel is the dot product of the two.

If you have a one-dimensional query vector of size N, Q(1xN), and a database of document vectors of size M by N, D(MxN), finding the best match is not a database query; is a matrix multiplication function, the dot product of D with a Q shift.

Scores = D.Q^T

NumPy is designed to do this kind of work efficiently, using processes that leverage modern CPU features such as vectorization.

Implementation

We will create a class called SimpleVectorStore managing import, indexing, and retrieval. Our input data will contain one or more files that contain the text we want to search. Using Sentence Transformers for local embedding will make everything work offline.

What is required

Set up a new development environment, install the required libraries, and start a Jupyter notebook.

Type the following commands in the command shell. I use UV as my package manager; change to suit whatever tool you're using.

$ uv init ragdb
$ cd ragdb
$ uv venv ragdb
$ source ragdb/bin/activate
$ uv pip install numpy scikit-learn sentence-transformers jupyter
$ jupyter notebook

In-Memory Vector Store

We don't need a complicated server. What we need is a function to load our text data from input files and split it into byte-sized chunks, and a class with two arrays: one for the raw text chunks and one for the embedding matrix. Here is the code.

import numpy as np
import os
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from typing import List, Dict, Any
from pathlib import Path

class SimpleVectorStore:
    def __init__(self, model_name: str = 'all-MiniLM-L6-v2'):
        print(f"Loading embedding model: {model_name}...")
        self.encoder = SentenceTransformer(model_name)
        self.documents = []  # Stores the raw text and metadata
        self.embeddings = None # Will become a numpy array 

    def add_documents(self, docs: List[Dict[str, Any]]):
        """
        Ingests documents.
        docs format: [{'text': '...', 'metadata': {...}}, ...]
        """
        texts = [d['text'] for d in docs]
        
        # 1. Generate Embeddings
        print(f"Embedding {len(texts)} documents...")
        new_embeddings = self.encoder.encode(texts)
        
        # 2. Normalize Embeddings 
        # (Critical optimization: allows dot product to approximate cosine similarity)
        norm = np.linalg.norm(new_embeddings, axis=1, keepdims=True)
        new_embeddings = new_embeddings / norm
        
        # 3. Update Storage
        if self.embeddings is None:
            self.embeddings = new_embeddings
        else:
            self.embeddings = np.vstack([self.embeddings, new_embeddings])
            
        self.documents.extend(docs)
        print(f"Store now contains {len(self.documents)} documents.")

    def search(self, query: str, k: int = 5):
        """
        Retrieves the top-k most similar documents.
        """
        if self.embeddings is None or len(self.documents) == 0:
            print("Warning: Vector store is empty. No documents to search.")
            return []

        # 1. Embed and Normalize Query
        query_vec = self.encoder.encode([query])
        norm = np.linalg.norm(query_vec, axis=1, keepdims=True)
        query_vec = query_vec / norm
        
        # 2. Vectorized Search (Matrix Multiplication)
        # Result shape: (1, N_docs)
        scores = np.dot(self.embeddings, query_vec.T).flatten()
        
        # 3. Get Top-K Indices
        # argsort sorts ascending, so we take the last k and reverse them
        # Ensure k doesn't exceed the number of documents
        k = min(k, len(self.documents))
        top_k_indices = np.argsort(scores)[-k:][::-1]
        
        results = []
        for idx in top_k_indices:
            results.append({
                "score": float(scores[idx]),
                "text": self.documents[idx]['text'],
                "metadata": self.documents[idx].get('metadata', {})
            })
            
        return results

def load_from_directory(directory_path: str, chunk_size: int = 1000, overlap: int = 200):
    """
    Reads .txt files and splits them into overlapping chunks.
    """
    docs = []
    # Use pathlib for robust path handling and resolution
    path = Path(directory_path).resolve()
    
    if not path.exists():
        print(f"Error: Directory '{path}' not found.")
        print(f"Current working directory: {os.getcwd()}")
        return docs
        
    print(f"Loading documents from: {path}")
    for file_path in path.glob("*.txt"):
        try:
            with open(file_path, "r", encoding="utf-8") as f:
                text = f.read()
                
            # Simple sliding window chunking
            # We iterate through the text with a step size smaller than the chunk size
            # to create overlap (preserving context between chunks).
            step = chunk_size - overlap
            for i in range(0, len(text), step):
                chunk = text[i : i + chunk_size]
                
                # Skip chunks that are too small (e.g., leftover whitespace)
                if len(chunk) < 50:
                    continue
                    
                docs.append({
                    "text": chunk,
                    "metadata": {
                        "source": file_path.name,
                        "chunk_index": i
                    }
                })
        except Exception as e:
            print(f"Warning: Could not read file {file_path.name}: {e}")
            
    print(f"Successfully loaded {len(docs)} chunks from {len(list(path.glob('*.txt')))} files.")
    return docs

The embedded model used

The All-MiniLM-L6-v2 model used in the code is from The phrase Transformers the library. This was chosen because,

  1. It is fast and lightweight.
  2. Generates 384-dimensional vectors that use less memory than larger models.
  3. It works well in a variety of English language tasks without requiring special preparation.

This model is just a suggestion. You can use any embedding model you want if you have a particular favorite.

Why Normalize?

You may notice some familiarization steps in the code. We mentioned it earlier, but to be clear, given two vectors X and Y, the cosine similarity is defined as

Similarity = (X · Y) / (||X|| * ||Y||)

Where:

  • IX · Y is the dot product of vectors X and Y
  • ||X|| the magnitude (length) of the vector X
  • ||Y|| the magnitude of the Y vector

Since division takes an extra number, if all our vectors have unit magnitude, the denominator is 1, so the formula boils down to the dot product of X and Y, making the search much faster.

Performance Evaluation

The first thing we need to do is find the input data to work with. You can use any input text file for this. For a previous RAG study, I used a book I downloaded from Project Gutenberg. Further interesting:

Diseases of cattle, sheep, goats, and pigs by Jno. AW Dollar and G. Mousse”

Note that you can view the Gutenberg Project's Permissions, Licensing and other Common Requests page using the following link.

But to summarize, most of Project Gutenberg's eBooks are in the public domain in the US and other parts of the world. This means that no one can give or withhold permission to do this as they wish.

“… as you wish” including any commercial use, republication in any format, derivative works or performance

I downloaded the text of the book from the Project Gutenberg website to my local PC using this link,

The book contained about 36,000 lines of text. Querying this book takes only six lines of code. In my sample question, line 2315 of the book discusses a disease called CONDYLOMATA. Here is the episode,

DIGITAL SPACE HEAT.

(CONDYLOMATA.)

Condylomata are caused by chronic inflammation of the skin that covers it
interdigital ligament. Any injury in this region is equally responsible
superficial injuries may cause chronic inflammation of the skin as well
hypertrophy of the papillæ, the first stage in production
condylomata.

The injury produced by the wires slipped into the interdigital space for the
the purpose of raising the feet when the working oxen are trampled is also fruitful
causes.

So we will ask you that, “What is Condylomata?” Note that we won't get the correct answer since we don't feed our search result to LLM, but we should see that our search returns a snippet that would give LLM all the information it needs to make an answer if we did.

%%time
# 1. Initialize
store = SimpleVectorStore()

# 2. Load Documents
real_docs = load_from_directory("/mnt/d/book")

# 3. Add to Store
if real_docs:
   store.add_documents(real_docs)

# 4. Search
results = store.search("What is Condylomata?", k=1)

results

And here is the output.

Loading embedding model: all-MiniLM-L6-v2...
Loading documents from: /mnt/d/book
Successfully loaded 2205 chunks from 1 files.
Embedding 2205 documents...
Store now contains 2205 documents.
CPU times: user 3.27 s, sys: 377 ms, total: 3.65 s
Wall time: 3.82 s

[{'score': 0.44883957505226135,
  'text': 'two lastnphalanges, the latter operation being easier than 
the former, andnproviding flaps of more regular shape and better adapted 
for thenproduction of a satisfactory stump.nnn                
INFLAMMATION OF THE INTERDIGITAL SPACE.nn(CONDYLOMATA.)nn
Condylomata result from chronic inflammation of the skin covering 
theninterdigital ligament. Any injury to this region causing 
evennsuperficial damage may result in chronic inflammation of the 
skin andnhypertrophy of the papillæ, the first stage in the production 
ofncondylomata.nnInjuries produced by cords slipped into the 
interdigital space for thenpurpose of lifting the feet when shoeing 
working oxen are also fruitfulncauses.nnInflammation of the 
interdigital space is also a common complication ofnaphthous eruptions 
around the claws and in the space between them.nContinual contact with 
litter, dung and urine favour infection ofnsuperficial or deep wounds, 
and by causing exuberant granulation lead tonhypertrophy of the papillary 
layer of ',
  'metadata': {'source': 'cattle_disease.txt', 'chunk_index': 122400}}]

Less than 4 seconds to read, compile, save, and query a 36000 line text document is very smooth.

SciKit-Learn: A Development Approach

NumPy works well for brute-force searches. But what if you have dozens or hundreds of documents, and brute-force is slow? Before switching to a vector database, you can try SciKit-Learn's Nearest Neighbors. It uses tree-based structures such as KD-Tree and Ball-Tree to speed up the search to O(log N) instead of O(N).

To test this, I downloaded a number of other books from Gutenberg, including:-

  • A Christmas Carol by Charles Dickens
  • The Life and Adventures of Santa Claus by L. Frank Baum
  • War and Peace by Tolstoy
  • Farewell to Arms by Hemingway

In total, these books contain about 120,000 lines of text. I copied and pasted all five files of the installation manual ten times, resulting in fifty files and 1.2 million lines of text. That's 12 million words, assuming an average of 10 words per line. To give you some context, this article contains about 2800 words, so the volume of data we are testing is equal to more than 4000 times the volume of this document.

$ dir

achristmascarol - Copy (2).txt  cattle_disease - Copy (9).txt  santa - Copy (6).txt
achristmascarol - Copy (3).txt  cattle_disease - Copy.txt       santa - Copy (7).txt
achristmascarol - Copy (4).txt  cattle_disease.txt                santa - Copy (8).txt
achristmascarol - Copy (5).txt  farewelltoarms - Copy (2).txt  santa - Copy (9).txt
achristmascarol - Copy (6).txt  farewelltoarms - Copy (3).txt  santa - Copy.txt
achristmascarol - Copy (7).txt  farewelltoarms - Copy (4).txt  santa.txt
achristmascarol - Copy (8).txt  farewelltoarms - Copy (5).txt  warandpeace - Copy (2).txt
achristmascarol - Copy (9).txt  farewelltoarms - Copy (6).txt  warandpeace - Copy (3).txt
achristmascarol - Copy.txt       farewelltoarms - Copy (7).txt  warandpeace - Copy (4).txt
achristmascarol.txt                farewelltoarms - Copy (8).txt  warandpeace - Copy (5).txt
cattle_disease - Copy (2).txt   farewelltoarms - Copy (9).txt  warandpeace - Copy (6).txt
cattle_disease - Copy (3).txt   farewelltoarms - Copy.txt       warandpeace - Copy (7).txt
cattle_disease - Copy (4).txt   farewelltoarms.txt                warandpeace - Copy (8).txt
cattle_disease - Copy (5).txt   santa - Copy (2).txt           warandpeace - Copy (9).txt
cattle_disease - Copy (6).txt   santa - Copy (3).txt           warandpeace - Copy.txt
cattle_disease - Copy (7).txt   santa - Copy (4).txt           warandpeace.txt
cattle_disease - Copy (8).txt   santa - Copy (5).txtLet's say we are ut

Suppose we were finally looking for an answer to the following question,

Who, after the Christmas holidays, did Nicholas tell his mother about his love?

In case you didn't know, this is from the book War and Peace.

Let's see how our new search performs against this wealth of information.

Here is the code that uses SciKit-Learn.

First, we have a new class that uses SciKit-Learn's nearest neighbor algorithm.

from sklearn.neighbors import NearestNeighbors

class ScikitVectorStore(SimpleVectorStore):
    def __init__(self, model_name='all-MiniLM-L6-v2'):
        super().__init__(model_name)
        # Brute force is often faster than trees for high-dimensional data 
        # unless N is very large, but 'ball_tree' can help in specific cases.
        self.knn = NearestNeighbors(n_neighbors=5, metric='cosine', algorithm='brute')
        self.is_fit = False

    def build_index(self):
        print("Building Scikit-Learn Index...")
        self.knn.fit(self.embeddings)
        self.is_fit = True

    def search(self, query: str, k: int = 5):
        if not self.is_fit: self.build_index()
        
        query_vec = self.encoder.encode([query])
        # Note: Scikit-learn handles normalization internally for cosine metric 
        # if configured, but explicit is better.
        
        distances, indices = self.knn.kneighbors(query_vec, n_neighbors=k)
        
        results = []
        for i in range(k):
            idx = indices[0][i]
            # Convert distance back to similarity score (1 - dist)
            score = 1 - distances[0][i]
            results.append({
                "score": score,
                "text": self.documents[idx]['text']
            })
        return results

And our search code is as simple as the NumPy version.

%%time

# 1. Initialize
store = ScikitVectorStore()

# 2. Load Documents
real_docs = load_from_directory("/mnt/d/book")

# 3. Add to Store
if real_docs:
   store.add_documents(real_docs)

# 4. Search
results = store.search("Who, after the Christmas holidays, did Nicholas tell his mother of his love for", k=1)

results

And our exit.

Loading embedding model: all-MiniLM-L6-v2...
Loading documents from: /mnt/d/book
Successfully loaded 73060 chunks from 50 files.
Embedding 73060 documents...
Store now contains 73060 documents.
Building Scikit-Learn Index...
CPU times: user 1min 46s, sys: 18.3 s, total: 2min 4s
Wall time: 1min 13s

[{'score': 0.6972659826278687,
  'text': 'nCHAPTER XIIInnSoon after the Christmas holidays Nicholas told 
his mother of his lovenfor Sónya and of his firm resolve to marry her. The 
countess, whonhad long noticed what was going on between them and was 
expecting thisndeclaration, listened to him in silence and then told her son 
that henmight marry whom he pleased, but that neither she nor his father 
wouldngive their blessing to such a marriage. Nicholas, for the first time,
nfelt that his mother was displeased with him and that, despite her loven
for him, she would not give way. Coldly, without looking at her son,nshe 
sent for her husband and, when he came, tried briefly and coldly toninform 
him of the facts, in her son's presence, but unable to restrainnherself she 
burst into tears of vexation and left the room. The oldncount began 
irresolutely to admonish Nicholas and beg him to abandon hisnpurpose. 
Nicholas replied that he could not go back on his word, and hisnfather, 
sighing and evidently disconcerted, very soon became silent ',
  'metadata': {'source': 'warandpeace - Copy (6).txt',
   'chunk_index': 1396000}}]

Almost all of the 1m 13 it takes to perform the above processing is spent loading and concatenating our input data. The actual search part, when I ran it separately, took less than a tenth of a second!

Not shabby at all.

Summary

I am not arguing that Vector Databases are unnecessary. They solve some problems that NumPy and SciKit-Learn can't handle. You have to come from something like ours SimpleVectorStore or ScikitVectorStore to Weaviate/Pinecone/pgvector, etc, if any of the following conditions apply.

Persistence: You need data to survive server reboots without rebuilding the index from the source files every time. Or np.save or pickling works for easy persistence. Engineering always involves trade-offs. Using a vector database adds complexity to your setup for robustness that you may not currently need. If you start with a direct RAG setup using NumPy and/or SciKit-Learn for the retrieval process, you get:

RAM is a buffer: Your embedding matrix exceeds your server's memory. Note: 1 million vectors have 384 dimensions [float32] it's only ~1.5GB of RAM, so you can fit a lot in memory.

Frequency of CRUD: You need to constantly update or delete individual vectors while reading. NumPy arrays, for example, are immutable, and adding requires copying the entire array, which is slow.

Filtering metadata: You need complex queries like “Find vectors around X where user_id=10 AND date > 2023”. Doing this in NumPy requires a boolean mask which can be messy.

Engineering always involves trade-offs. Using a vector database adds complexity to your setup for robustness that you may not currently need. If you start with a direct RAG setup using NumPy and/or SciKit-Learn for the retrieval process, you get:

  • Low Latency. There are no network hops.
  • Low cost. No SaaS subscriptions or additional instances.
  • Simplicity. It's just a Python script.

Just like you don't need a sports car to go to the grocery store. In most cases, NumPy or SciKit-Learn can be all the RAG search you need.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button