Optimizing Vector Search: Why You Should Beat Structured Data

nimda January 29, 2026

0 8 6 minutes read

Optimizing Vector Search: Why You Should Beat Structured Data

structured data in the RAG system, developers often automatically embed the raw JSON into a vector database. The truth, however, is that this intuitive approach leads to poor performance. Modern embeddings are based on the BERT architecture, which is part of the Transformer encoder, and they are trained on large text datasets with the main goal of capturing semantic meaning. Modern embedding models can provide impressive retrieval performance, but they are trained on a large set of unstructured text with a focus on semantic meaning. As a result, even though JSON embedding may look like a simple and intuitive solution, using the standard embedding model for JSON objects will show results far from optimal.

Deep dive

Making tokens

The first step is tokenization, which takes the text and breaks it down into tokens, which are usually common parts of a word. Modern embedding models use Byte-Pair Encoding (BPE) or WordPiece tokenization algorithms. These algorithms are designed for natural language, which divides words into small regular parts. When the token meets raw JSON, it struggles with the high frequency of non-numeric characters and numbers. For example, "usd": 10, it is not considered a key-value pair; instead, it is split:

Quotes ("), colon (:), and a comma (,)
Tokens usd again 10

This creates a low signal-to-noise ratio. In natural language, almost all words contribute to a semantic “signal”. While in JSON (and other structured formats), a significant percentage of tokens are “discarded” in the syntax of a structure that contains a semantic value of zero.

Attention figure

The basic strength of Transformers lies in the way of attention. This allows the model to measure the importance of tokens relative to each other.

In a sentence The price is 10 US dollars or 9 eurosattention can easily link value 10 in the imagination price because this relationship is well represented in the model's previous training data and the model has seen this language pattern millions of times. On the other hand, in raw JSON:

"price": {
  "usd": 10,
  "eur": 9,
 }

the model meets the structural syntax was not developed primarily to “learn”. Without a language connector, the resulting vector will fail to capture the true intent of the data, as the relationship between key and value is obscured by the format itself.

It means Integration

The final step in producing a single embedding presentation of a document is Mean Pooling. Statistically, the final embedding (E) is the centroid of all vector tokens (e1, e2, e3) in the document:

Direct Pooling equation: Converting a sequence of n token embeddings into a single vector representation by averaging their values. Photo by the author.

This is where JSON tokens become a statistical liability. If 25% of the tokens in the document are structural symbols (braces, quotes, colons), the final vector is strongly influenced by the “meaning” of the punctuation marks. As a result, the vector is effectively “pulled” from its original semantic center into the vector space by these sound tokens. When a user submits a natural language query, the distance between the “clean” query vector and the “noisy” JSON vector increases, directly harming retrieval metrics.

Make it flat

So now that we know the limitations of JSON, we need to figure out how to solve them. The most common and straightforward way is to soften the JSON and convert it to natural language.

Let's consider a typical product:

{
 "skuId": "123",
 "description": "This is a test product used for demonstration purposes",
 "quantity": 5,
 "price": {
  "usd": 10,
  "eur": 9,
 },
 "availableDiscounts": ["1", "2", "3"],
 "giftCardAvailable": "true", 
 "category": "demo product"
 ...
}

This is a simple object with some features like description, etc. Let's apply a token to it and see how it looks:

Raw JSON tokenization. Note the high density of unique syntax tokens (braces, quotation marks, colons) that contribute to noise rather than meaning. *Screenshot by author using* *OpenAI Tokenizer*

Now, let's convert it to text to make the embedding task easier. To do that, we can define a template and replace the JSON values in it. For example, this template can be used to describe a product:

Product with SKU {skuId} belongs to the category "{category}"
Description: {description}
It has a quantity of {quantity} available 
The price is {price.usd} US dollars or {price.eur} euros  
Available discount ids include {availableDiscounts as comma-separated list}  
Gift cards are {giftCardAvailable ? "available" : "not available"} for this product

So the end result will look like this:

Product with SKU 123 belongs to the category "demo product"
Description: This is a test product used for demonstration purposes
It has a quantity of 5 available
The price is 10 US dollars or 9 euros
Available discount ids include 1, 2, and 3
Gift cards are available for this product

Then apply the tokenizer to it:

Tokenize plain text. The resulting sequence is short (less than 14% tokens) and consists primarily of words with semantic meaning. *Screenshot by author using* *OpenAI Tokenizer*

Not only does it have fewer 14% tokens now, but it is also a very clear form with the necessary semantic meaning and context.

Let's measure the results

Note: The complete, reproducible code for this experiment is available in the Google Colab notebook

Now let's try to measure the retrieval performance of both options. We'll focus on standard recall metrics like Recall@k, Precision@k, and MRR to keep it simple, and we'll use a standard embedding model (all-MiniLM-L6-v2) and an Amazon ESCI dataset with 5,000 random queries and 3,809 related products.

all-MiniLM-L6-v2 is a popular option, smaller (22.7m params) but provides fast and accurate results, making it a good choice for this test.

For the dataset, the Amazon ESCI version is used, specifically milistu/amazon-esci-data (), which is available on Hugging Face and contains a collection of Amazon products and search query data.

The flattening function used for text conversion is:

def flatten_product(product):
  return (
    f"Product {product['product_title']} from brand {product['product_brand']}" 
    f" and product id {product['product_id']}" 
    f" and description {product['product_description']}"
)

A sample raw JSON data is:

{
  "product_id": "B07NKPWJMG",
  "title": "RoWood 3D Puzzles for Adults, Wooden Mechanical Gear Kits for Teens Kids Age 14+",
  "description": " Specifications
Model Number: Rowood Treasure box LK502
Average build time: 5 hours
Total Pieces: 123
Model weight: 0.69 kg
Box weight: 0.74 KG
Assembled size: 100*124*85 mm
Box size: 320*235*39 mm
Certificates: EN71,-1,-2,-3,ASTMF963
Recommended Age Range: 14+
Contents
Plywood sheets
Metal Spring
Illustrated instructions
Accessories
MADE FOR ASSEMBLY
-Follow the instructions provided in the booklet and assembly 3d puzzle with some exciting and engaging fun. Fell the pride of self creation getting this exquisite wooden work like a pro.
GLORIFY YOUR LIVING SPACE
-Revive the enigmatic charm and cheer your parties and get-togethers with an experience that is unique and interesting .
",
  "brand": "RoWood",
  "color": "Treasure Box"
}

For vector search, two FAISS indexes are created: one for plain text and one for JSON-formatted text. Both indexes are flat, meaning they will compare distances to each existing entry instead of using an Approximate Nearest Neighbor (ANN) index. This is important to ensure that the retrieval metrics are not affected by the ANN.

D = 384
index_json = faiss.IndexFlatIP(D)
index_flatten = faiss.IndexFlatIP(D)

To reduce the dataset a random number of 5,000 questions were selected and all corresponding products were embedded and added to the indexes. As a result, the collected metrics are as follows:

Comparing two methods of indexing using i `all-MiniLM-L6-v2` embedding model in the Amazon ESCI dataset. The flat method consistently produces high scores for all key retrieval metrics (Precision@10, Recall@10, and MRR). *Photo by author*

And the performance modification of the flat version is:

Converting formatted JSON into natural language text results in a number of benefits, including a **19.1% increases Recall@10** and a **27.2% increase in MRR (Mean Reciprocal Rank)**which ensures a high semantic representation of flat data. *Photo by the author.*

The analysis confirms that embedding structured data into a generic vector space is a minimal method and adding a simple pre-processing step to flatten the structured data brings a significant improvement of the retrieval metrics (boosting recall@k and precision@k by about 20%). A key takeaway for developers building RAG systems is that effective data preparation is critical to achieving high performance of a semantic/RAG retrieval system.

References

[1] Complete test code https://colab.research.google.com/drive/1dTgt6xwmA6CeIKE38lf2cZVahaJNbQB1?usp=sharing
[2] Model https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
[3] Amazon ESCI data. Specific version used: https://huggingface.co/datasets/milistu/amazon-esci-data
The original dataset is available at https://www.amazon.science/code-and-datasets/shopping-queries-dataset-a-large-scale-esci-benchmark-for-improving-product-search
[4] FAISS

Source link

nimda January 29, 2026

0 8 6 minutes read