Amplify the Expert: A Philosophy for Building Enterprise RAG

0 1 14 minutes read

Amplify the Expert: A Philosophy for Building Enterprise RAG

of Enterprise Document Intelligence, a series that builds an enterprise RAG system from four bricks: parsing, question parsing, retrieval, and generation.

Amplify the expert: the thesis behind every architectural choice in the series.

where this article fits in the series: a manifesto alongside the numbered spine – Image by author

If you have to remember one idea from this series, it is this: enterprise RAG amplifies the expert. It does not replace them. This piece sets down the thesis up front, before the techniques start, because every later article derives from it.

Most architectural mistakes in production RAG follow from forgetting this. Once you accept it, the rest of the series stops being a catalog of techniques and starts looking like a coherent argument.

1. The thesis in one sentence

This series is about building RAG systems that amplify enterprise experts working with their own documents, not about building general-purpose document intelligence that replaces them.

The premise sounds modest but it changes most architectural choices. The system’s job is to scale judgment that already exists in human form: the lawyer who has read a thousand contracts, the underwriter who reaches for the deductible clause on reflex, the compliance officer who knows which sentence the auditor will ask about. Those people are the source of truth. The system handles volume, finds passages in seconds, compares documents systematically. It does not pretend to be the expert.

Every other position the series defends derives from this thesis. Vector stores are a fallback because the expert already knows the keywords. Deterministic dispatchers beat autonomous agents because the expert needs to audit what happened. Expert dictionaries beat fine-tuned embeddings because the expert’s vocabulary is richer than any IDF formula or vector space could capture.

2. The gap between two camps

Most enterprises run two parallel realities on the same documents: an opaque vector-store pipeline the IT camp built, and an expert who still searches with Ctrl+F because nothing the IT camp shipped earned their trust. The series sits in the bridge between the two.

*The two camps and the bridge the series sits in – Image by author*

On the IT side, the camp told by vendors and conference talks to chunk every document, push it into a vector store, embed every query, and trust that cosine similarity will find the right passage. They build the system, they run it, and if you ask them precisely why a given chunk came back, very few can answer. The architecture is opaque even to the people who deployed it.

On the expert side, decades of accumulated reading. Lawyers who have read a thousand contracts. Underwriters who have priced ten thousand policies. Compliance officers who can name the clause an auditor will ask about before the auditor walks in. Ask them how they search a document. The honest answer is almost always the same. They open the PDF, hit Ctrl+F, type a keyword they know works in their corpus, find the passage. If the keyword misses, they go to the table of contents, locate the right section, scan it line by line. That is the retrieval method that decades of expertise has converged on.

The gap is not benign. The IT-camp system is opaque even to the people who built it; the expert-camp method is precise but does not scale. The series’s natural move is to bring them together: take the method the expert already trusts (keyword search anchored on real vocabulary, then TOC navigation when keywords miss) and use the LLM to scale it. LLMs are now strong enough that the retrieval stage no longer has to be clever to compensate. The 2022-era reflex of stacking embedding tricks on top of a weak generation model was solving a problem that no longer exists at the same intensity. Retrieval can stay close to the expert’s natural workflow without losing answer quality.

Underneath the two camps sits a distinction worth stating plainly. There are two ways to answer a question, and they are not the same operation:

From the model’s parametric memory. You write the question, the model answers, one step. That is a chatbot, and for general knowledge it is enough.
From a document. Two phases that have to stay apart. First the passage is found, by keyword the way the expert reaches for Ctrl+F, not by handing the model the raw question. Only then is the question answered, against the document rather than against the model’s training.

Enterprise work is the second case, and the rest of the series keeps the two phases apart.

Mirroring the expert’s method this closely is not cosmetic. The point is not that vector stores are wrong everywhere; the point is that adopting a method the expert cannot recognize, on documents the expert knows by heart, is the fastest way to lose their trust. Without trust, the system does not get used, and a system that is not used has zero value regardless of how impressive its benchmarks look.

3. The historical parallel: machine learning ten years ago

RAG is repeating the enterprise ML wave of 2015 to 2020 verbatim. The same vendor-copying reflex, the same generic templates, the same failure modes. What worked then, and what will work for RAG now, is domain-specific work anchored on existing expertise.

*The two enterprise waves share the same shape ten years apart – Image by author*

Between 2015 and 2020, enterprises tried to build ML systems by copying Google, DeepMind, and Facebook. “Build a model that learns” was the slogan. Most enterprise ML projects from that era failed to reach production. Gartner put the figure at around 85% in 2019, and the practitioners who lived through the wave cite numbers in the same range. They failed for the same reasons every time. Enterprise companies do not have Google’s data scale. They do not have research teams. They do not have unlimited compute budgets. They do not have the open-ended use cases that justify general approaches.

What ended up working in enterprise ML was domain-specific work. Actuarial forecasting tuned to insurance. Document classification calibrated on internal vocabulary. Risk scoring that exploited the variables domain experts had already identified as predictive. The systems that delivered value were the ones that built on existing expertise, not the ones that tried to learn it from scratch.

RAG is repeating this exact pattern. Enterprises copy OpenAI, plug their data into generic managed RAG products, vectorize everything by default. The failure modes are the same as the ML decade: too much generality, not enough domain anchoring, no answer for the cases the benchmarks did not cover. The alternative is the same answer that worked ten years ago. Domain-specific RAG. Codify the expertise that already exists. Use the structure of documents the team already knows. Amplify the expert instead of bypassing them.

This parallel matters for two reasons. It gives the argument historical depth (we have seen this movie). And it gives the argument constructive framing (we are not against OpenAI; we are saying the trajectory is known and the alternative is to build for our context, not theirs).

4. Where this applies (and where it does not)

The thesis is not universal. Four context properties decide whether this series is your guide. When all four hold, the architecture earns its place; when one is missing, a different stance fits better.

*The four conditions for this architecture and the verdict either way – Image by author*

The four properties are:

The document context is known. The system is deployed on a specific class of documents whose structure, vocabulary, and conventions are known: insurance contracts, medical records, legal agreements, regulatory filings, financial statements, technical specifications. Domain knowledge is an input to the system, not something to be discovered by it.
Domain experts exist and are accessible. The team building the system can talk to the people who use the documents day to day. Those experts know the vocabulary, where each kind of information lives, which keywords retrieve which clauses, which questions matter most. That expertise gets codified into the system rather than guessed at by a generic model.
The goal is amplification, not replacement. The expert continues to exist after the system ships. The system helps them handle volumes they could not process manually and find information in seconds instead of fifteen minutes. The stance is technical (current AI cannot reliably replace expert judgment on non-trivial cases) and operational (experts do not want to be replaced, and systems that pretend otherwise get rejected).
The system must be auditable by the expert. Retrievals must be traceable, answers must cite their sources, decisions must be explainable, behavior must be reproducible. A system the expert cannot audit is one the expert will not use.

These four hold for most enterprise document intelligence work. Insurance brokers, law firms, hospitals, banks, government agencies, any organization where experts work with structured documents under regulatory scrutiny.

Where it does not. Open-domain QA over the web, consumer chat, exploration of a corpus where no expert exists, settings where the questions are unbounded. There, general-purpose retrieval and autonomous agents make more sense. The trade-off shifts: you sacrifice audit and reproducibility, but you also do not have an expert who would have used either. Those are different problems and the architecture should be different. The series’s stance is defensible precisely because it admits where it does not apply.

5. The three founding principles

Amplifying the expert turns into code under three disciplines: choose techniques the expert recognises, build a pyramidal architecture a new engineer can trace in one sitting, and use relational tables (not strings) at every brick junction.

*The three disciplines and the design test each one carries – Image by author*

Pragmatic, expertise-driven. Every choice is judged on a single criterion: does it build on years of accumulated expertise from the people who already know these documents? If yes, it ships. If not, it is noise. The series has no patience for techniques that ignore the expert’s wisdom in favor of a generic model that re-learns it badly from scratch. Fine-tuning an embedding model on domain data is a fallback when expert vocabulary is unavailable, not a default move when the dictionary could be written in an afternoon by sitting with the underwriter.
Pyramidal engineering, not a loose collection of tricks. A production RAG system has to be readable, scalable, and maintainable five years from now. Four clearly named bricks at the top (parsing, question parsing, retrieval, generation), each decomposed into a handful of named functions, each function doing one thing on well-defined inputs and outputs. No orchestration loops (the kind where an LLM chooses the next step until it decides to stop), no hidden state, no “the LLM figures it out”. Concrete design test: a senior engineer joining the team should be able to trace a request from input to output by reading code alone, in one sitting, without an oral handoff. If that is not possible, the architecture has failed. Without this clarity, the system rots: every new feature breaks something old, every contributor gets lost, every audit takes weeks.
Relational data at every brick. Document data is unstructured, and you cannot do anything with unstructured data. So the series structures it, at every brick, into relational tables. Parsing turns the PDF into a set of linked DataFrames (line_df, page_df, image_df, toc_df, span_df, object_registry). Question parsing turns the user’s question into a relational set too (question_df plus satellites). Retrieval becomes a query on those structures. Generation structures its output: a typed Pydantic answer, line-level citations, self-assessment fields. The junctions between bricks are tables, not strings. String-soup at any junction produces half the debugging pain in production RAG.

These three are not features. They are the discipline that makes the series’s specific architectural choices defensible across many years and many contributors.

6. The four bricks, through this philosophy

The four bricks (parsing, question parsing, retrieval, generation) are common to most RAG architectures. What is specific here is that each one mirrors something the expert does mentally and amplifies it on the axes a manual workflow cannot reach. Every later article in the series develops one of these four ideas in code.

*Each brick mirrors an expert action and amplifies on one axis – Image by author*

Parsing mirrors how the expert scans a document on first read: grasp the topic, find the section list, spot where the numbers live. The parser does that scan once and keeps the result. Everything missed here cannot be recovered downstream, which makes parsing the most important choice in the pipeline.

Question parsing mirrors the Ctrl-F reflex: the expert starts by typing two or three keywords. The brick keeps that and amplifies it on two axes Ctrl-F can’t reach (co-occurrence and expert-dictionary expansion), then splits the question into a retrieval brief and a generation brief that downstream bricks consume separately.

Retrieval mirrors the triage the expert does after Ctrl-F returns thirty hits: drop the off-topic ones, keep the few worth a second look. The brick does that at scale and keeps three things apart that “top-k chunks” collapses, the anchor (where the match lands), the scope (what goes to generation), and the context (the surrounding the expert reads by reflex). The criterion is “the set worth a second human pass”, not “top-k by cosine”.

Article 7 (retrieval): the frame
Article 7A (retrieval as filtering): a filter on line_df and toc_df
Article 7B (anchor detection): parallel detectors, one LLM call at the end
Article 7C (the LLM arbiter): picks the final candidate, with reasons
Article 12 (listing): when the answer is all the matches, not one

Generation is where the discipline against fabrication lives: a faithful restatement of what the retrieved scope says plus the citation to verify it, never a paraphrase that drifts. The LLM fills a typed Pydantic schema (answer, line citations, answer_found, confidence, caveats) that the expert controls by writing the schema and the prompt.

Article 8a (the answer contract): the typed answer with citations and self-checks
Article 8b (prompt assembly): prompt + schema + trace from a parsed question
Article 8c (validation): the validator and the feedback loop that closes the pipeline
Article 13 (the workflow pipeline): wire the four upgraded bricks into one
Article 9 (the upgraded pipeline): the Article 1 (minimal RAG) baseline, upgraded brick by brick

Every brick respects the same discipline: structured input, structured output, no string-soup at any junction. That makes the system queryable, auditable, replayable, and joinable across years of accumulated questions and answers. Part IV (Article 14 the corpus problem, 15 preparing the corpus, 16 ontology, 17 querying the corpus) shows what the same four bricks become at corpus scale, with a SQL-shaped corpus_index, an ontology in five relational tables, and corpus-level QA. Part V (Article 18 code architecture, 19 storage, 20 evaluation, 21 cost & latency, 22 security) makes the architecture operable across years.

7. What follows from the thesis

The series defends six counter-positions against the mainstream RAG playbook. They are not stylistic choices: each one follows mechanically from the thesis once the four context properties hold.

*Six condition-to-consequence rows derived from the four context properties – Image by author*

If experts know the keywords, the vector store cannot be the foundation. It is the fallback for cases where the dictionary missed an alias.
If embeddings are useful for finding synonyms, they are a discovery tool whose output goes into the expert dictionary, not a production retriever queried on every call. Article 2 catalogues where embedding similarity wins and where it predictably breaks; Article 2bis does the same for cross-encoders.
If retrieval is strong because it filters structured DataFrames built from expert vocabulary, the reranker has no work to do that the upstream filter has not already done.
If the expert must audit every answer, the dispatcher must be deterministic and inspectable, not an autonomous loop. Article 13 (the workflow pipeline) is the brick that holds this discipline.
If the corpus belongs to a specific business with specific documents, vectorising everything indexes noise; structuring at ingestion produces signal that compounds. Article 3 is the argument that RAG is not machine learning, and the ML toolkit solves the wrong problem; Article 4 maps techniques to problems on two axes (document complexity, question control); Article 4bis catalogues the ten production mistakes the field keeps repeating.

The point of this piece is that these positions are not independent. They are one argument with six visible consequences.

8. Sources and further reading

This epilogue is the philosophical anchor of the series. The framing of expert judgment as a renewable resource comes from Tetlock and Gardner (Superforecasting, 2015). The tool-as-amplifier philosophy that maps directly to RAG architecture is from Norman (The Design of Everyday Things, 1988). Anthropic’s Building Effective Agents (Dec 2024) is the industry framing of when workflows win over agents. The classic short paper behind the amplify-the-expert tiebreaker is Bainbridge’s Ironies of Automation (1983): the more advanced the automation, the more the human contribution matters. Agentic patterns where the agent still uses the audited bricks the expert curated are follow-up work.

Same direction as the epilogue:

Tetlock & Gardner, Superforecasting: The Art and Science of Prediction, 2015. Expert judgment as a renewable resource; the amplify the expert thesis treats domain experts the way Tetlock treats superforecasters.
Norman, The Design of Everyday Things, 1988/2013. Tool-as-amplifier rather than tool-as-replacement; the philosophy applies to RAG architecture the same way it applies to door handles.
Anthropic, Building Effective Agents, December 2024. When LLM agents work and when deterministic workflows win; the decision matrix matches this series’ philosophy.
Carr, The Glass Cage: How Our Computers Are Changing Us, W.W. Norton 2014. Cautionary book on automation that bypasses expert judgment; the broker-corpus stories in the series are concrete instances of Carr’s concerns.
Bainbridge, Ironies of Automation, Automatica 1983. Classic short paper: the more advanced the automation, the more the human contribution matters. The philosophical backing for the amplify-the-expert tiebreaker.

Different angle, different context:

Bostrom, Superintelligence: Paths, Dangers, Strategies, Oxford University Press 2014. The strongest philosophical case for systems that aim past expert amplification toward full autonomy. The context is long-term AGI; this epilogue handles enterprise document work where experts are accessible and audit is required.
Yao et al., ReAct: Synergizing Reasoning and Acting in Language Models, ICLR 2023 (arXiv:2210.03629). Agent reasons and acts without human-curated routing. The context is general-purpose tool-picking; developing this line on top of the audited bricks the expert maintains is follow-up work.

Earlier in the series:

Part I: What works, what breaks

Baseline Enterprise RAG, from PDF to highlighted answer. The four-brick pipeline end to end: PDF in, highlighted answer out.
Embeddings Aren’t Magic: The Predictable Failure Modes of RAG Retrieval. Where embedding similarity wins (synonyms, typos, paraphrase), where it predictably breaks (unknown terms, negation, term-vs-answer relevance), and how to use it anyway.
RAG is not machine learning, and the ML toolkit solves the wrong problem. Why chunk-size sweeps and finetuning optimize the wrong thing; route by question type instead.
From regex to vision models: which RAG technique fits which problem. Two axes, document complexity and question control, that pick the technique for each case.

Part II: The four bricks

Beyond extract_text: the two layers of a PDF that drive RAG quality. The first half of the parsing brick: the document’s nature, signals, and summary.
Stop returning flat text from a PDF: the relational tables RAG needs. The second half of the parsing brick: the relational tables every downstream brick reads.
- When PyMuPDF can’t see the table: parse PDFs for RAG with Azure Layout. The same tables from Azure Layout: native table cells, OCR, paragraph roles.
- Parse PDFs for RAG locally with Docling: rich tables, no cloud upload. The same tables computed locally with Docling: TableFormer cells, nothing leaves the machine.
- Vision LLMs are PDF parsers too: reading charts and diagrams for RAG. Vision as a parser: the pictures become searchable text.
- Parse scanned PDFs for RAG with EasyOCR: free OCR gives you words, not a document. Where traditional OCR stops: text recovered, structure lost.
- Making a PDF’s images searchable for RAG, without paying to read them all. The image cascade: filter cheap, classify, describe only what is worth reading.
- Reconstructing the table of contents a PDF forgot to ship, so RAG can scope by section. Rebuilding toc_df when the PDF prints a contents page but ships no outline.
RAG questions need parsing too: turn the user’s string into briefs for retrieval and generation. The thesis of question parsing: why a user string needs the same parsing as a document, and how it splits into a retrieval brief and a generation brief.
What the question parser extracts from a user string: keywords, scope, shape, decomposition, clarification. The five families of columns the parser reads straight from the user’s question, with the code that fills each one.
Dispatching the parsed RAG question: chunk strategy, model tier, activations, audit. The decisions the parser makes on top of the user string, using the document’s profile: dispatch, activations, full schema, the audit trail (pipeline_trace.json), and a broker-corpus walkthrough.
Retrieval is filtering, not search: a mental model for enterprise RAG. Retrieval reframed as filtering on line_df and toc_df: anchors small, context large.
Anchor detection for RAG: parallel detectors, then one LLM call at the end (link to come). Parallel anchor detectors: keyword always, embeddings alongside, one LLM call at the end.
Letting an LLM pick the right RAG page: the arbiter pattern at the end of retrieval (link to come). The LLM arbiter: candidates ranked with reasons, one typed JSON out.
- Context engineering: the pipeline you have been building has a name (link to come). Context engineering named: the four bricks emit typed context, Lance Martin’s four strategies map to docintel primitives.