You Probably Don't Need a Vector Database for Your RAG – Yet

after Retrieval Augmented Generation (RAG), vector databases are getting a lot of attention in the world of AI.
Many people say you need tools like Pinecone, Weaviate, Milvus, or Qdrant to build a RAG system and manage your embedding. If you're working on enterprise applications with hundreds of millions of vectors, then tools like these are essential. They allow you to perform CRUD operations, filter by metadata, and use a disk-based index that bypasses your computer's memory.
But for many internal tools, document bots, or MVP agents, adding a dedicated vector database can be overkill. It increases complexity, network latency, adds integration costs, and makes things more difficult to manage.
The truth is that “Vector Search” (ie, the retrieval part of RAG) is just matrix multiplication. And Python already has some of the best tools in the world for that.
In this article, we'll show you how to build a production-friendly one recovery part of the RAG pipeline for small to medium volumes using only NumPy and SciKit-Learn. You will see that it is possible to search millions of characters of text in milliseconds, all in memory and without external dependencies.
Understanding Regression as Matrix Math
In general, RAG includes four main steps:
- Embed: Convert your text source data to vectors (arrays of floating point numbers)
- Store: Extract those vectors into the database
- Return: Find vectors that are mathematically “close” to the query vector.
- Produce: Feed the LLM the corresponding text and receive your final answer.
Steps 1 and 4 depend on the macro-language models. Steps 2 and 3 are the background of Vector DB. We will focus on parts 2 and 3 and how to avoid using vector DBs altogether.
But if we search our vector database, what exactly is “proximity”? Usually, it is Cosine similarity. If your two vectors are normalized to have magnitude 1, then the cosine parallel is the dot product of the two.
If you have a one-dimensional query vector of size N, Q(1xN), and a database of document vectors of size M by N, D(MxN), finding the best match is not a database query; is a matrix multiplication function, the dot product of D with a Q shift.
Scores = D.Q^T
NumPy is designed to do this kind of work efficiently, using processes that leverage modern CPU features such as vectorization.
Implementation
We will create a class called SimpleVectorStore managing import, indexing, and retrieval. Our input data will contain one or more files that contain the text we want to search. Using Sentence Transformers for local embedding will make everything work offline.
What is required
Set up a new development environment, install the required libraries, and start a Jupyter notebook.
Type the following commands in the command shell. I use UV as my package manager; change to suit whatever tool you're using.
$ uv init ragdb
$ cd ragdb
$ uv venv ragdb
$ source ragdb/bin/activate
$ uv pip install numpy scikit-learn sentence-transformers jupyter
$ jupyter notebook
In-Memory Vector Store
We don't need a complicated server. What we need is a function to load our text data from input files and split it into byte-sized chunks, and a class with two arrays: one for the raw text chunks and one for the embedding matrix. Here is the code.
import numpy as np
import os
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from typing import List, Dict, Any
from pathlib import Path
class SimpleVectorStore:
def __init__(self, model_name: str = 'all-MiniLM-L6-v2'):
print(f"Loading embedding model: {model_name}...")
self.encoder = SentenceTransformer(model_name)
self.documents = [] # Stores the raw text and metadata
self.embeddings = None # Will become a numpy array
def add_documents(self, docs: List[Dict[str, Any]]):
"""
Ingests documents.
docs format: [{'text': '...', 'metadata': {...}}, ...]
"""
texts = [d['text'] for d in docs]
# 1. Generate Embeddings
print(f"Embedding {len(texts)} documents...")
new_embeddings = self.encoder.encode(texts)
# 2. Normalize Embeddings
# (Critical optimization: allows dot product to approximate cosine similarity)
norm = np.linalg.norm(new_embeddings, axis=1, keepdims=True)
new_embeddings = new_embeddings / norm
# 3. Update Storage
if self.embeddings is None:
self.embeddings = new_embeddings
else:
self.embeddings = np.vstack([self.embeddings, new_embeddings])
self.documents.extend(docs)
print(f"Store now contains {len(self.documents)} documents.")
def search(self, query: str, k: int = 5):
"""
Retrieves the top-k most similar documents.
"""
if self.embeddings is None or len(self.documents) == 0:
print("Warning: Vector store is empty. No documents to search.")
return []
# 1. Embed and Normalize Query
query_vec = self.encoder.encode([query])
norm = np.linalg.norm(query_vec, axis=1, keepdims=True)
query_vec = query_vec / norm
# 2. Vectorized Search (Matrix Multiplication)
# Result shape: (1, N_docs)
scores = np.dot(self.embeddings, query_vec.T).flatten()
# 3. Get Top-K Indices
# argsort sorts ascending, so we take the last k and reverse them
# Ensure k doesn't exceed the number of documents
k = min(k, len(self.documents))
top_k_indices = np.argsort(scores)[-k:][::-1]
results = []
for idx in top_k_indices:
results.append({
"score": float(scores[idx]),
"text": self.documents[idx]['text'],
"metadata": self.documents[idx].get('metadata', {})
})
return results
def load_from_directory(directory_path: str, chunk_size: int = 1000, overlap: int = 200):
"""
Reads .txt files and splits them into overlapping chunks.
"""
docs = []
# Use pathlib for robust path handling and resolution
path = Path(directory_path).resolve()
if not path.exists():
print(f"Error: Directory '{path}' not found.")
print(f"Current working directory: {os.getcwd()}")
return docs
print(f"Loading documents from: {path}")
for file_path in path.glob("*.txt"):
try:
with open(file_path, "r", encoding="utf-8") as f:
text = f.read()
# Simple sliding window chunking
# We iterate through the text with a step size smaller than the chunk size
# to create overlap (preserving context between chunks).
step = chunk_size - overlap
for i in range(0, len(text), step):
chunk = text[i : i + chunk_size]
# Skip chunks that are too small (e.g., leftover whitespace)
if len(chunk) < 50:
continue
docs.append({
"text": chunk,
"metadata": {
"source": file_path.name,
"chunk_index": i
}
})
except Exception as e:
print(f"Warning: Could not read file {file_path.name}: {e}")
print(f"Successfully loaded {len(docs)} chunks from {len(list(path.glob('*.txt')))} files.")
return docs
The embedded model used
The All-MiniLM-L6-v2 model used in the code is from The phrase Transformers the library. This was chosen because,
- It is fast and lightweight.
- Generates 384-dimensional vectors that use less memory than larger models.
- It works well in a variety of English language tasks without requiring special preparation.
This model is just a suggestion. You can use any embedding model you want if you have a particular favorite.
Why Normalize?
You may notice some familiarization steps in the code. We mentioned it earlier, but to be clear, given two vectors X and Y, the cosine similarity is defined as
Similarity = (X · Y) / (||X|| * ||Y||)
Where:
- IX · Y is the dot product of vectors X and Y
- ||X|| the magnitude (length) of the vector X
- ||Y|| the magnitude of the Y vector
Since division takes an extra number, if all our vectors have unit magnitude, the denominator is 1, so the formula boils down to the dot product of X and Y, making the search much faster.
Performance Evaluation
The first thing we need to do is find the input data to work with. You can use any input text file for this. For a previous RAG study, I used a book I downloaded from Project Gutenberg. Further interesting:
“Diseases of cattle, sheep, goats, and pigs by Jno. AW Dollar and G. Mousse”
Note that you can view the Gutenberg Project's Permissions, Licensing and other Common Requests page using the following link.
But to summarize, most of Project Gutenberg's eBooks are in the public domain in the US and other parts of the world. This means that no one can give or withhold permission to do this as they wish.
“… as you wish” including any commercial use, republication in any format, derivative works or performance
I downloaded the text of the book from the Project Gutenberg website to my local PC using this link,
The book contained about 36,000 lines of text. Querying this book takes only six lines of code. In my sample question, line 2315 of the book discusses a disease called CONDYLOMATA. Here is the episode,
DIGITAL SPACE HEAT.
(CONDYLOMATA.)
Condylomata are caused by chronic inflammation of the skin that covers it
interdigital ligament. Any injury in this region is equally responsible
superficial injuries may cause chronic inflammation of the skin as well
hypertrophy of the papillæ, the first stage in production
condylomata.The injury produced by the wires slipped into the interdigital space for the
the purpose of raising the feet when the working oxen are trampled is also fruitful
causes.
So we will ask you that, “What is Condylomata?” Note that we won't get the correct answer since we don't feed our search result to LLM, but we should see that our search returns a snippet that would give LLM all the information it needs to make an answer if we did.
%%time
# 1. Initialize
store = SimpleVectorStore()
# 2. Load Documents
real_docs = load_from_directory("/mnt/d/book")
# 3. Add to Store
if real_docs:
store.add_documents(real_docs)
# 4. Search
results = store.search("What is Condylomata?", k=1)
results
And here is the output.
Loading embedding model: all-MiniLM-L6-v2...
Loading documents from: /mnt/d/book
Successfully loaded 2205 chunks from 1 files.
Embedding 2205 documents...
Store now contains 2205 documents.
CPU times: user 3.27 s, sys: 377 ms, total: 3.65 s
Wall time: 3.82 s
[{'score': 0.44883957505226135,
'text': 'two lastnphalanges, the latter operation being easier than
the former, andnproviding flaps of more regular shape and better adapted
for thenproduction of a satisfactory stump.nnn
INFLAMMATION OF THE INTERDIGITAL SPACE.nn(CONDYLOMATA.)nn
Condylomata result from chronic inflammation of the skin covering
theninterdigital ligament. Any injury to this region causing
evennsuperficial damage may result in chronic inflammation of the
skin andnhypertrophy of the papillæ, the first stage in the production
ofncondylomata.nnInjuries produced by cords slipped into the
interdigital space for thenpurpose of lifting the feet when shoeing
working oxen are also fruitfulncauses.nnInflammation of the
interdigital space is also a common complication ofnaphthous eruptions
around the claws and in the space between them.nContinual contact with
litter, dung and urine favour infection ofnsuperficial or deep wounds,
and by causing exuberant granulation lead tonhypertrophy of the papillary
layer of ',
'metadata': {'source': 'cattle_disease.txt', 'chunk_index': 122400}}]
Less than 4 seconds to read, compile, save, and query a 36000 line text document is very smooth.
SciKit-Learn: A Development Approach
NumPy works well for brute-force searches. But what if you have dozens or hundreds of documents, and brute-force is slow? Before switching to a vector database, you can try SciKit-Learn's Nearest Neighbors. It uses tree-based structures such as KD-Tree and Ball-Tree to speed up the search to O(log N) instead of O(N).
To test this, I downloaded a number of other books from Gutenberg, including:-
- A Christmas Carol by Charles Dickens
- The Life and Adventures of Santa Claus by L. Frank Baum
- War and Peace by Tolstoy
- Farewell to Arms by Hemingway
In total, these books contain about 120,000 lines of text. I copied and pasted all five files of the installation manual ten times, resulting in fifty files and 1.2 million lines of text. That's 12 million words, assuming an average of 10 words per line. To give you some context, this article contains about 2800 words, so the volume of data we are testing is equal to more than 4000 times the volume of this document.
$ dir
achristmascarol - Copy (2).txt cattle_disease - Copy (9).txt santa - Copy (6).txt
achristmascarol - Copy (3).txt cattle_disease - Copy.txt santa - Copy (7).txt
achristmascarol - Copy (4).txt cattle_disease.txt santa - Copy (8).txt
achristmascarol - Copy (5).txt farewelltoarms - Copy (2).txt santa - Copy (9).txt
achristmascarol - Copy (6).txt farewelltoarms - Copy (3).txt santa - Copy.txt
achristmascarol - Copy (7).txt farewelltoarms - Copy (4).txt santa.txt
achristmascarol - Copy (8).txt farewelltoarms - Copy (5).txt warandpeace - Copy (2).txt
achristmascarol - Copy (9).txt farewelltoarms - Copy (6).txt warandpeace - Copy (3).txt
achristmascarol - Copy.txt farewelltoarms - Copy (7).txt warandpeace - Copy (4).txt
achristmascarol.txt farewelltoarms - Copy (8).txt warandpeace - Copy (5).txt
cattle_disease - Copy (2).txt farewelltoarms - Copy (9).txt warandpeace - Copy (6).txt
cattle_disease - Copy (3).txt farewelltoarms - Copy.txt warandpeace - Copy (7).txt
cattle_disease - Copy (4).txt farewelltoarms.txt warandpeace - Copy (8).txt
cattle_disease - Copy (5).txt santa - Copy (2).txt warandpeace - Copy (9).txt
cattle_disease - Copy (6).txt santa - Copy (3).txt warandpeace - Copy.txt
cattle_disease - Copy (7).txt santa - Copy (4).txt warandpeace.txt
cattle_disease - Copy (8).txt santa - Copy (5).txtLet's say we are ut
Suppose we were finally looking for an answer to the following question,
Who, after the Christmas holidays, did Nicholas tell his mother about his love?
In case you didn't know, this is from the book War and Peace.
Let's see how our new search performs against this wealth of information.
Here is the code that uses SciKit-Learn.
First, we have a new class that uses SciKit-Learn's nearest neighbor algorithm.
from sklearn.neighbors import NearestNeighbors
class ScikitVectorStore(SimpleVectorStore):
def __init__(self, model_name='all-MiniLM-L6-v2'):
super().__init__(model_name)
# Brute force is often faster than trees for high-dimensional data
# unless N is very large, but 'ball_tree' can help in specific cases.
self.knn = NearestNeighbors(n_neighbors=5, metric='cosine', algorithm='brute')
self.is_fit = False
def build_index(self):
print("Building Scikit-Learn Index...")
self.knn.fit(self.embeddings)
self.is_fit = True
def search(self, query: str, k: int = 5):
if not self.is_fit: self.build_index()
query_vec = self.encoder.encode([query])
# Note: Scikit-learn handles normalization internally for cosine metric
# if configured, but explicit is better.
distances, indices = self.knn.kneighbors(query_vec, n_neighbors=k)
results = []
for i in range(k):
idx = indices[0][i]
# Convert distance back to similarity score (1 - dist)
score = 1 - distances[0][i]
results.append({
"score": score,
"text": self.documents[idx]['text']
})
return results
And our search code is as simple as the NumPy version.
%%time
# 1. Initialize
store = ScikitVectorStore()
# 2. Load Documents
real_docs = load_from_directory("/mnt/d/book")
# 3. Add to Store
if real_docs:
store.add_documents(real_docs)
# 4. Search
results = store.search("Who, after the Christmas holidays, did Nicholas tell his mother of his love for", k=1)
results
And our exit.
Loading embedding model: all-MiniLM-L6-v2...
Loading documents from: /mnt/d/book
Successfully loaded 73060 chunks from 50 files.
Embedding 73060 documents...
Store now contains 73060 documents.
Building Scikit-Learn Index...
CPU times: user 1min 46s, sys: 18.3 s, total: 2min 4s
Wall time: 1min 13s
[{'score': 0.6972659826278687,
'text': 'nCHAPTER XIIInnSoon after the Christmas holidays Nicholas told
his mother of his lovenfor Sónya and of his firm resolve to marry her. The
countess, whonhad long noticed what was going on between them and was
expecting thisndeclaration, listened to him in silence and then told her son
that henmight marry whom he pleased, but that neither she nor his father
wouldngive their blessing to such a marriage. Nicholas, for the first time,
nfelt that his mother was displeased with him and that, despite her loven
for him, she would not give way. Coldly, without looking at her son,nshe
sent for her husband and, when he came, tried briefly and coldly toninform
him of the facts, in her son's presence, but unable to restrainnherself she
burst into tears of vexation and left the room. The oldncount began
irresolutely to admonish Nicholas and beg him to abandon hisnpurpose.
Nicholas replied that he could not go back on his word, and hisnfather,
sighing and evidently disconcerted, very soon became silent ',
'metadata': {'source': 'warandpeace - Copy (6).txt',
'chunk_index': 1396000}}]
Almost all of the 1m 13 it takes to perform the above processing is spent loading and concatenating our input data. The actual search part, when I ran it separately, took less than a tenth of a second!
Not shabby at all.
Summary
I am not arguing that Vector Databases are unnecessary. They solve some problems that NumPy and SciKit-Learn can't handle. You have to come from something like ours SimpleVectorStore or ScikitVectorStore to Weaviate/Pinecone/pgvector, etc, if any of the following conditions apply.
Persistence: You need data to survive server reboots without rebuilding the index from the source files every time. Or np.save or pickling works for easy persistence. Engineering always involves trade-offs. Using a vector database adds complexity to your setup for robustness that you may not currently need. If you start with a direct RAG setup using NumPy and/or SciKit-Learn for the retrieval process, you get:
RAM is a buffer: Your embedding matrix exceeds your server's memory. Note: 1 million vectors have 384 dimensions [float32] it's only ~1.5GB of RAM, so you can fit a lot in memory.
Frequency of CRUD: You need to constantly update or delete individual vectors while reading. NumPy arrays, for example, are immutable, and adding requires copying the entire array, which is slow.
Filtering metadata: You need complex queries like “Find vectors around X where user_id=10 AND date > 2023”. Doing this in NumPy requires a boolean mask which can be messy.
Engineering always involves trade-offs. Using a vector database adds complexity to your setup for robustness that you may not currently need. If you start with a direct RAG setup using NumPy and/or SciKit-Learn for the retrieval process, you get:
- Low Latency. There are no network hops.
- Low cost. No SaaS subscriptions or additional instances.
- Simplicity. It's just a Python script.
Just like you don't need a sports car to go to the grocery store. In most cases, NumPy or SciKit-Learn can be all the RAG search you need.



