Machine Learning

When (Not) Use Vector DB

. They solve a real problem, and in many cases, are the right choice for RAG systems. But here's the thing: just because you're using embedding doesn't mean you are the need vector database.

We've seen a growing trend where every RAG implementation starts with connecting a vector DB. That may make sense for large, persistent databases, but it's not always the most efficient approach, especially when your application is very dynamic or time-sensitive.

At Planck, we use embedding to develop LLM-based systems. However, in one of our real-world applications, we chose to do so avoid vector database and instead used a simple key-value storewhich turned out to be a much better match.

Before I dive into that, let's examine a simple, general version of our situation to explain why.

Foo Example

Let's consider a simple RAG-style program. The user uploads a few text files, perhaps some reports or meeting notes. We split those files into chunks, generate an embedding for each chunk, and use that embedding to answer queries. The operator asks a few questions over the next few minutes, then leaves. At that point, both files and embeds are useless and can be safely discarded.

In other words, the data is extremethe user will only ask a a few questionsand we want to answer them as soon as possible.

Now pause and ask yourself:

Where should I save this embed?


Most people's instinct is: “I have embedding, so I need a vector database”, but stop for a moment and think about what's really going on behind that abstraction. When you send an embed to a vector DB, it doesn't just “save” you. Build an index that speeds up matching searches. That targeting work is where a lot of the magic comes from, and where a lot of the cost resides.

For a long-lived, large database, this trade-off makes perfect sense: you pay the index cost once (or incrementally as the data changes), then spread that cost over millions of queries. In our Foo example, that's not what happens. We do the opposite: we always add small, one-off embedding batches, answer a small number of questions per batch, and discard everything.

So the real question is not “should I use a vector database?” but”is indexing work worth it?” To answer that, we can look at a simple benchmark.

Benchmarking: No-Index Returns vs. Indexed Retrieval

Photo by Julia Fiander on Unsplash

This section is very active. We will look at Python code and explain the basic algorithms. If the specific usage details don't suit you, feel free to skip to Results part.

We want to compare two systems:

  1. There is no targeting at all, it just stores the embedded in memory and scans directly.
  2. Vector databasewhere we pay index fees in advance to make each query faster.

First, consider the “no vector DB” approach. When a query comes in, we calculate the similarity between the query embedding and all the stored embeddings, and choose at least k. That is the K-Nearest neighbors without any index.

import numpy as np

def run_knn(embeddings: np.ndarray, query_embedding: np.ndarray, top_k: int) -> np.ndarray:
    sims = embeddings @ query_embedding
    return sims.argsort()[-top_k:][::-1]

The code uses the dot product as a proxy for cosine similarity (assuming normal vectors) and sorts the scores to find the best match. It is what it is scans all vectors and selects the closest ones.

Now, let's look at what a vector DB usually does. Under the hood, most vector information depends on nearest neighbor index (ANN). ANN methods trade off less accuracy for improved search speed, and one of the most widely used methods for this is HNSW. We will use the hnswlib library to simulate indexing behavior.

import numpy as np
import hnswlib

def create_hnsw_index(embeddings: np.ndarray, num_dims: int) -> hnswlib.Index:
    index = hnswlib.Index(space='cosine', dim=num_dims)
    index.init_index(max_elements=embeddings.shape[0])
    index.add_items(embeddings)
    return index

def query_hnsw(index: hnswlib.Index, query_embedding: np.ndarray, top_k: int) -> np.ndarray:
    labels, distances = index.knn_query(query_embedding, k=top_k)
    return labels[0]

To see where the trade-off sits, we can generate a random embedding, standardize it, and measure how long each step takes:

import time
import numpy as np
import hnswlib
from tqdm import tqdm

def run_benchmark(num_embeddings: int, num_dims: int, top_k: int, num_iterations: int) -> None:
    print(f"Benchmarking with {num_embeddings} embeddings of dimension {num_dims}, retrieving top-{top_k} nearest neighbors.")

    knn_times: list[float] = []
    index_times: list[float] = []
    hnsw_query_times: list[float] = []

    for _ in tqdm(range(num_iterations), desc="Running benchmark"):
        embeddings = np.random.rand(num_embeddings, num_dims).astype('float32')
        embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
        query_embedding = np.random.rand(num_dims).astype('float32')
        query_embedding = query_embedding / np.linalg.norm(query_embedding)

        start_time = time.time()
        run_knn(embeddings, query_embedding, top_k)
        knn_times.append((time.time() - start_time) * 1e3)

        start_time = time.time()
        vector_db_index = create_hnsw_index(embeddings, num_dims)
        index_times.append((time.time() - start_time) * 1e3)

        start_time = time.time()
        query_hnsw(vector_db_index, query_embedding, top_k)
        hnsw_query_times.append((time.time() - start_time) * 1e3)

    print(f"BENCHMARK RESULTS (averaged over {num_iterations} iterations)")
    print(f"[Naive KNN] Average search time without indexing: {np.mean(knn_times):.2f} ms")
    print(f"[HNSW Index] Average index construction time: {np.mean(index_times):.2f} ms")
    print(f"[HNSW Index] Average query time with indexing: {np.mean(hnsw_query_times):.2f} ms")

run_benchmark(num_embeddings=50000, num_dims=1536, top_k=5, num_iterations=20)

Results

In this example, we use 50,000 embeddings of size 1,536 (compatible with OpenAI's text-embedding-3-small) and find the top five neighbors. The exact results will vary with different settings, but the pattern we care about is the same.

I encourage you to run a benchmark with your numbers, it's the best way to see how the trade-offs play out in your specific use case.

On average, the unwise KNN search takes 24.54 milliseconds per query. Building the HNSW index for the same embedding takes about 277 seconds. Once the index is created, each query takes about 0.47 milliseconds.

In this case, we can measure the break point. The difference between the naive KNN and indexed queries is 24.07 ms per query. That means you need to 11,510 queries before the time saved per query compensates for the time spent building the index.

Generated using benchmark code: Graph comparing naive KNN and indexed search performance

Moreover, even with different values ​​of the embedding value and top-k, the break-even point remains in thousands of queries and remains within a very small distance. You don't get a situation where referrals start paying after just a few queries.

Generated using benchmark code: Graph showing overlap scores for various embedding calculations and top-k settings (author's photo)

Now compare that to the Foo example. The user uploads a small set of files and asks a few questions, not thousands. The program does not reach where the indicator pays. Instead, the indexing step simply delays the time the system can answer the initial query and adds operational complexity.

In this kind of ad-hoc, per-user context, the simple in-memory KNN method is not only easy to implement and perform, but also fast end-to-end.

If in-memory storage is not an option, either because the system is distributed or because we need to save user state for a few minutes, we can use key-value store like Redis. We can store the unique identifier of the user request as a key and store all embeds as a value.

This gives us a lightweight, low-complexity solution that is well-suited to our use case short term, low question situations.

Real World Example: Why We Chose Key Value Store

Photo by Gavin Allanwood on Unsplash

At Planck, we answer insurance-related questions for businesses. A typical application starts with a business name and address, then downloads real time data about that particular business, including its online presence, registration, and other public records. This data becomes our core, and we use LLMs and algorithms to answer questions based on it.

The bottom line is that every time we get a request, we produce it new context. We don't reuse existing data, it's downloaded on demand and always relevant at least for a few minutes.

If you think back to the previous benchmark, this pattern should already trigger your “this is not a vector DB use case” sensor.

Every time we receive a request, we generate a new embed of temporary data that we may query several hundred times. Indexing that embedding in a vector DB adds unnecessary latency. Conversely, with Redis, we can immediately save the embedded again use a quick search for similarities in the application code almost no targeting delay.

That's why we chose Redis instead of a vector database. Although Vector DBs are very efficient at handling large volumes of embeddings and support fast neighbor queries, they introduce pointing upand in our case, that high is not appropriate.

In conclusion

If you need to store millions of embeds and support high query loads across a shared corpus, a vector DB would be a better fit. And yes, there are use cases out there that really need and benefit from vector DB.

But just because you use embed or building a RAG system doesn't mean you have to make a mistake in the vector DB.

Each database technology has its own strengths and trade-offs. The best option starts with a deep understanding of your data and use case, rather than mindlessly following a habit.

So, the next time you need to choose a website, stop for a moment and ask: am I choosing the right one based on the target market, or am I just going with the trendy, shiny choice?

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button