Untaught Lessons of RAG Retrieval: Cosine Is Not a Basis

0 4 6 minutes read

Untaught Lessons of RAG Retrieval: Cosine Is Not a Basis

friend to Enterprise Document Intelligenceseries whose philosophy is set in Grow the Expert. It zooms in brick 3 (retrieval) of four brick buildings and presents many overlapping subjects.

A typical story has a return like embed query, replace up-k with cosine, optionally fix. We disagree with almost every part of it. Retrieval is a thing sorting in structured tablesnot free text search. Embedding is a voluntary fallback, not a foundation. Anchor and core are two particles, not one. Each of these is a defensible position, with measurable consequences.

where this article sits in the series: brick 7 (retrieval) is highlighted – Photo of the author

📓 Notebooks that might work are on GitHub: doc-intel/notebooks-vol1.

*Community code repo at doc-intel/notebooks-vol1 – Image by author*

The absurd premise this article goes on

*Architectural differences: one cosine signal over slices vs three signals in parallel in structured tables – Photo by author*

The naive pipeline splits the document, embeds all parts, embeds the query, and cosines it. That one signal is opaque, and throws off the document's structure. We save the document as line_df + toc_df and use the three retrieval signals in parallel (keyword in lines, TOC reasoning, cosine embedding), then allow the LLM arbiter rate once for all three sets of visible hits.

*Keywords always work, TOC always gives reasons, fires only when vocabulary is inconsistent – Photo by author*

Below are six untaught lessons of this brick.

Lesson 1 – Retrieval is sorting, not searching

Once parsing is complete, retrieval is a more SQL-like sorting problem line_df again toc_dfthe inverse of the chunk-embed-cosine-top-k framework. The switch is easy to say: the query has columns, the document has columns, and the return is a join.

Why is it important. Search again filter They are not synonymous, these two activities have different mechanics. Search finds all candidates with continuous matching (cosine , BM25), power a top-k cutoff, and always returns something, even when the answer is not in the document. Filter use boolean condition (line.contains("X") , toc.title in [...]), it keeps all rows identical and gone, and can return zero lines where the document has no response. The research result is a big part of the gap: the state of the filter is a single line of testable code that works the same way in six months; the quality of the search depends on which dimensions of embedding are important, and you cannot replay that judgment without re-running the model.

A tangible highlight. The user asks “What postal encoding does the paper use?”. Naive RAG embeds question, scores 300+, returns top 5. Series RAG filters line_df where the line contains "positional encoding" (4 hits), filters toc_df where the title of the section contains "positional" (1 paragraph, 3.5 Location Encoding), and the arbiter sees both, anchor: line; scope: category. No cosine is needed.

→ Article 7A: Retrieval is sorting, not searching sets a mental model.

Lesson 2 – Anchor and context, divided

You focus the anchor on one line that says “premium” (accurate) but transfer the entire surrounding category to the generation (sufficient context); combining them breaks precision and combining one move. Top-k forces you to choose: small pieces lose context, big pieces lose precision. We get both, by separating them.

A tangible highlight. For a definition question, anchor a single line ( "the deductible is the amount the insured pays before coverage begins" ), the scope is the section that surrounds it (the three total sentences LLM needs to call the answer). Naive top-k returns a line (no context) or a paragraph (anchor is not clear). Restoration of the thread anchor + diameter as a typed pair.

→ Article 7A: Retrieval is sorting, not searching draws a line between anchor and context.

Lesson 3 – Embedding comes last, not first

Keywords are always active (cheap, decisive); A document's TOC is a first-level retrieval system; embedding is the last signal of choice, only if word mismatch is expected. The reflex of the age of 2024 begins with embedding; we leave it to cases where cheap signs have failed.

A tangible highlight. A real look at the insurance policy: “first date?”. Naive RAG embeds, returning 5 pieces. The string opens keyword "effective" again "date" → 1 line found → done. Embedding did not work. Cost: one regex pass line_df; a few milliseconds. A cosine search for 2 cents failed.

→ Article 7B: Finding the right pegs creates a pipe with three symbols.

Lesson 4 – Key words prove absence; embedding is not possible

A zero in a keyword search means that the answer doesn't really exist; the zero in the same embedding may be an absence or a different word, so the embedding is a refinement, not a decision gate. This asymmetry is the nature of keywords as the main signal in the business RAG.

A tangible highlight. The user asks “does this contract cover earthquake damage?” with flood policy only. Search for a keyword "earthquake" returns the same zero in between line_df . The pipe can go answer_found = False with confidence. Embedding the cosine returns 5 bits (the rows most related to the topic about natural disasters ) and the LLM, seeing them, may think that yes you are wrong. Keywords are saved for the day.

→ Article 7B: Getting the right pins explains the keyword-first command.

Lesson 5 – Co-occurrence beats BM25 on small corpora

BM25 measures by term frequency, but the shape of the business answer is one mention of the subject near a certain value, so the fulfillment increases and the regex anchors with high value beat the IDF statistics for the small company. The IDF projections break out of a corpus of 20 documents where each word is “rare” by Wikipedia standards.

A tangible highlight. The question is “how much is the deduction?”. BM25 levels by frequency of "deductible"; a line that appears 12 times in the words section ranks first. Search lines occur simultaneously containing both "deductible" and number; the original policy line ( "the deductible is $1000" ) is counted first because it is compatible with it $1000 and LLM can extract value cleanly.

→ Article 7B: Finding the right anchors measures the co-operation against BM25.

Lesson 6 – One LLM passes the TOC

Giving a line of 20-100 toc_df in the small model and ask which sections answer the question call one cached call and catch some words (“exit early” ≈ “Termination”) the same keywords are missed.

TOC thinking is one of the most widely used retrieval cues in RAG production.

A tangible highlight. The user asks “When can I leave the insurance early?”. Sub-thread matching is on "leave" returns zero TOC entries. LLM call to full TOC (28 lines, fits in one small notification) return section “Termination and Cancellation”correct phrase. Cached LLM call, deterministic later, and right anchor.

→ Article 7B reasons with the TOC, and Article 7C: LLM as an arbitrator adds an arbitrator.

The six lessons share one move: reject the chunk-embed-cosine reflex, and treat retrieval as sorting on structured tables instead. Keywords are always running because they prove absence; The TOC is a first-level signal because the document has already announced its structure; embedding is a voluntary refinement, not a foundation. Deep-dives (7A, 7B, 7C, 7bis) ship's running code in original documents; this piece is a catalog that identifies them.

In all fields and occupations

The same pattern of return of three signals (keyword is on line_df + the thinking continues toc_df + embed fallback ) holds for all domains. The vocabulary and depth of the TOC varies; the signal phase does not. Five fields below, one retrieval pattern, one test trace per call.

*The embedding is only lit on the medical line where the vocabulary separates from the document – Image of the author*

Embed fire only in the medical line, where the user's vocabulary ( “tachycardia” ) differs from document ( “rapid heartbeat” ). The other four lines resolve completely to keyword + TOC. Keywords prove absent (Lesson 4), TOC captures paraphrases (Lesson 6), and anchor/width separation keeps precision and context separate (Lesson 2) across lines. The cost gradient is real: four lines resolved by keywords run in milliseconds and zero LLM tokens; medical line pays for one embedding pass and one arbiter call.

Resources and continuing education

Standard books on returns are designed for web-scale search and short consumer company. The chain structure assumes a small business corpus where structure is known and vocabulary is a commodity.

Retrieval is sorting, not searching (Article 7A). Published article: mental model: retrieval as sorting in structured tables.
Embedding Is Not Magic (Part 2). A catalog of published failure modes for embedded parallelism.
Rerankers Aren't Magic Anymore (Article 2bis). When a reverse encoder pays and when it doesn't.

Source link

nimda 3 weeks ago

0 4 6 minutes read