Proxy-Pointer RAG: Multimodal Answers Without Multimodal Embeddings

0 0 12 minutes read

Proxy-Pointer RAG: Multimodal Answers Without Multimodal Embeddings

a picture is worth a thousand words. Yet, very few enterprise chatbots can reliably return images grounded in their source documents.

Why is that?

The reason is that although this would be a significant enhancement from the text-only user experience, it is difficult to do reliably and consistently. However, there is no dearth of use cases where this would be invaluable. From customers of real-estate projects to service technicians querying about the latest machine parameters, users would absolutely prefer to see the targeted, relevant property images and maintenance tables as part of the response. Instead, the best we can do is to get a response with links to the source documents (brochures, videos, manuals) and webpages.

In this article, I will present an open-source MultiModal Proxy-Pointer RAG pipeline, that can achieve this, primarily, because it looks at a document as a hierarchical tree of semantic blocks, rather than a bag-of-words that needs to be shredded blindly into chunks to answer a query.

This is a follow-up to my previous articles on Proxy-Pointer RAG, where I have explored the architecture rationale and implementation at length. Here, we will explore the following:

Why is multimodal response a difficult problem to solve? And what are the current techniques that can be applied?
How Proxy-Pointer achieves it with full scalability and minimal cost with a text-only pipeline — no multimodal embeddings required
A working prototype with test queries for you to try using the open-source repo.

Let’s begin.

Multimodality and Normal RAG

When we think of multimodal RAG, it almost always means you can search the knowledge base using images along with a text query. It is rarely the other way around. To understand the reasons, let’s look at the possible approaches how this is generally done:

Image captioning

Run a OCR/Vision model on the images, turn the image into a paragraph of text, and index it into a chunk along with other text. This is not ideal, since the sliding-window chunking may result in the image caption being split across chunks.

The core issue is a misalignment between retrieval units and semantic units. Traditional RAG retrieves arbitrary chunks, while meaning—and especially images—belongs to coherent sections of a document.

When a chunk is retrieved, the LLM may only see a partial caption (e.g., for Figure 5), making it difficult to determine whether the image is actually relevant to this chunk or to an adjacent one that was not retrieved. In addition, the synthesizer often receives multiple chunks from different documents with no shared context, potentially containing several unrelated image captions. This makes it difficult for the LLM to reliably decide which, if any, of the images are relevant to the user’s query.

Multimodal Embedding

Another approach is to embed both images and text into a shared vector space using a multimodal model. While this enables cross-modal retrieval, it introduces a different challenge. Multimodal embeddings optimize for similarity, not grounding. Visually or structurally similar artifacts—such as financial tables across different companies—can appear nearly identical in vector space, even when only one is relevant to the query.

Without the context of document structure, the system retrieves candidates based on similarity but cannot confidently determine which image actually belongs in the response. As a result, the LLM is forced to choose between multiple plausible but potentially incorrect visuals—often making it safer to return none at all than risk showing the wrong one.

Proxy-Pointer solves this by replacing the text-based chunking with a tree-based one. We don’t chunk by character count; we chunk by Sectional Boundaries. If a section contains 3 paragraphs and 2 images, none of the chunks go beyond and bleed into the following section. The LLM can consider each section as a fully independent semantic unit and can make the judgement on the images in them with confidence.

Let’s see how this will work in practice.

Prototype Setup

I built a Multimodal chatbot on 5 AI research papers (all CC-BY licensed). These are CLIP, Nemobot, GaLore, VectorFusion and VectorPainter. For pdf extraction, Adobe PDF Extract API was used. As can be expected, the papers contain dense text along with a total of 270 images (figures, tables, formulae) between them, that could be extracted by Adobe. The embedding model used is gemini-embedding-001 (with dimensions reduced to 1536 from default 3072, which makes the search faster and reduces memory usage). This is a text-only embedding model. No multimodal embedding model is used. For all LLM uses (noise filter, re-ranker, synthesizer and final vision filter), gemini-3.1-flash-lite-preview is used. Vector index used is FAISS.

Multimodal Proxy-Pointer Architecture

In my previous deep-dives, I shared evidence that Proxy-Pointer RAG could achieve 100% accuracy on financial 10-K documents by indexing “Strategic Pointers” (breadcrumbs like `Financials > Item 1A > Risk Factors`) instead of raw chunks.

For multimodal output, we modify the pipeline steps with the following premise — images (figures, tables, formulae, video clips etc) can be extracted as artifact files (.jpg, .png, .svg, .mp4 etc) and stored alongside the document content. This is quite simple if the source document is a webpage or XML. For pdfs, although not perfect, an extractor such as Adobe PDF Extract API, used here can extract the tables and figures as artifacts.

In the extracted document itself, which in our case is a markdown, every figure is present as a relative path eg; ![](figures/fileoutpart11.png) within the text, which points to the actual filename. Following is an illustration:

Furthermore, inspired by the Tangram puzzle which forms different objects using a set of basic elements, as illustrated in Fig. 2(b), we reform the synthesis task as a rearrangement of a set of strokes extracted from the reference image.

![](figures/fileoutpart1.png)

“The Starry Night”

![](figures/fileoutpart2.png)

“Self-Portrait”

![](figures/fileoutpart3.png)

Which brings us to the following key insight that Proxy-Pointer uses. In practice, the LLM does not need to see the image itself to determine relevance. Instead, it only needs to know that an image exists within a specific section of the document. Since Proxy-Pointer retrieval brings in complete sections—rather than fragmented chunks—the LLM can rely on the section’s full context to judge relevance. This turns image selection into a conditional decision based on the section’s meaning and the user query, rather than an open-ended search problem based on multimodal similarity match.

This is exactly how humans read. We don’t jump to view every table and figure mentioned —we first use the section context and our query to decide which ones are worth looking at.

Here is the indexing pipeline:

Skeleton Tree: As before, we parse the Markdown headings into a hierarchical tree with pure python. Only now, a figures array is nested within each node, which notes every figure found within that node (section) along with path. The path is used to retrieve the image file for display. The rest of the fields are self-explanatory as follows:

    {
      "title": "1 Introduction",
      "node_id": "0003",
      "line_num": 17,
      "figures": [
        {
          "fig_id": "fig_1",
          "filename": "figures/fileoutpart0.png"
        },
        {
          "fig_id": "fig_2",
          "filename": "tables/fileoutpart1.png"
        }
      ]
    },

The next 4 steps essentially remain the same as before:

Breadcrumb Injection: Prepend the full structural path (Galore > 3. Methodology > 3.1. Zero Convolution) to every chunk before embedding.

Structure-Guided Chunking: Split text within section boundaries, never across them

Noise Filtering: Remove distracting sections (TOC, glossary, executive summaries, references) from the index using a LLM.

Pointer-Based Context: Use retrieved chunks as pointers to load the full, unbroken document section (which now contain image paths within text) for the synthesizer

The updated retrieval pipeline for multimodal retrieval is as follows:

Stage 1 (Broad Recall): FAISS returns the top 200 chunks by embedding similarity. These are deduplicated by `(doc_id, node_id)` to ensure we are looking at unique document sections, resulting in a shortlist of the top 50 candidate nodes. This step remains same as before.

Stage 2 (Anchor-Aware Structural Re-Ranking): The re-ranker now receives the Full Breadcrumb Path (as before) + a Semantic Snippet (150 characters) for each of the 50 candidates. This was introduced because, unlike financial 10-Ks or technical manuals, academic papers often utilize generic, non-descriptive headings (like ‘3. Experiments,’ ‘4. Optimization,’ or ‘5. Comparison’). This requires the LLM to have a tiny ‘semantic hint’ to accurately pinpoint which of those vague sections actually contains the precision and similarity scores the user is asking for.

Stage 3 (Synthesis and Context-Aware Image Selection): The Synthesizer LLM reviews the final k=5 sections and forms the text response. In addition it makes the visual call on the images found within as to which should be displayed. It scans the sections for the image paths and selects a list of max 6 images that seems most relevant to the query. In addition, the synthesizer also forms accurate image labels for display even if that table or figure does not have any explicit caption by the author.

The above pipeline is able to provide 95% accuracy for image retrievals on the 20 question benchmark I created, as judged by Claude. I have shared a few of the results in the next section. The full results are available in the repo. Furthermore, if you wish to refine the results even more, the next step is an optional Vision filter.

Stage 4 (Vision Filter — optional): For further refinements of the selected images, an optional Vision selection step can be turned on in config.py. Here the LLM is asked to actually see the 6 images using its Vision capabilities, consider the user query and text response and drop any images that do not seem relevant. This results in precise, curated images being shown in the response but adds a few seconds of latency. This was not used for the benchmark test results.

Finally a simple streamlit UI is created to visualise the outputs.

Results

I tried the bot on a 20 question set — from precise retrieval, to cross document reasoning, structural reasoning etc. Full results are present in my github repo. The results are: 17 perfect retrievals, 1 no image retrieved, 2 partial retrievals. However, the most important observation here is that there is no case where an incorrect image from a document not related to the query has leaked into the results, thereby eroding trust in the bot. This is possible due to the surgical accuracy of Proxy-Pointer. Where image retrievals are not adequate, they are due to the LLM variance and Child-node reasons that I discuss in the next section.

Following is a snapshot of a few results. For brevity, the text responses are truncated here:

Precise Data Retrieval Question: What are the hyperparameters used for fine-tuning RoBERTa-Base for GaLore across different GLUE tasks?

The hyperparameters used for fine-tuning RoBERTa-Base for GaLore across GLUE tasks include a training duration of 30 epochs, a maximum sequence length of 512, and a batch size of 32 f
Sources:
• Galore > D. Fine-Tuning Experiments > D.1. Details of Fine-Tuning on GLUE
• Galore > Epochs
• Galore > 5. Experiments

AI-Verified Evidence List:
• Galore - Table 7: Hyperparameters of fine-tuning RoBERTa base for GaLore (Galore/tables/fileoutpart157.png)

Cross Document Reasoning Question: Both GaLore and CLIP-CITE address the challenge of efficient model training. Compare GaLore’s gradient projection approach with CLIP-CITE’s fine-tuning strategy in terms of memory efficiency and knowledge preservation.

Memory Efficiency

GaLore: Focuses on reducing the memory footprint of optimizer states (e.g., Adam) by projecting gradients into a low-rank subspace. By tracking only low-rank gradient statistics, it avoids storing full-rank matrices. It is more memory-efficient than LoRA
..contd
Sources:
• Galore > 4. GaLore for Memory-Efficient Training > 4.2. Memory-Efficient Optimization
• Galore > 5. Experiments > 5.4. Memory-Efficient Fine-Tuning
• CLIP > 4 Experiments > 4.3 Further Analysis

AI-Verified Evidence List:
• Galore - Table 1: Comparison between GaLore and LoRA (Galore/tables/fileoutpart39.png)
• CLIP - Table 5: Comparison performances and training efficiency (CLIP/tables/fileoutpart18.png)

Visual Query: Describe the VectorFusion pipeline stages: how is the raster image first sampled, then converted to SVG, and finally fine-tuned using latent score distillation

The VectorFusion pipeline generates SVGs through three primary stages:

Raster Sampling: A raster image is sampled from a pretrained text-to-image diffusion model (e.g., Stable Diffusion) using a Runge-Kutta solver. To ensure iconographic aesthetics, the input text is appended with a prompt suffix 
..contd
Sources:
• VectorFusion > 4. Method: VectorFusion
• VectorFusion > 4. Method: VectorFusion > 4.1. A baseline: text-to-image-to-vector
• VectorFusion > 4. Method: VectorFusion > 4.2. Sampling vector graphics by optimization

AI-Verified Evidence List:
• VectorFusion - Figure 3: VectorFusion generates SVGs in three stages. (VectorFusion/figures/fileoutpart29.png)
• VectorFusion - Figure 5: Overview of VectorFusion’s latent score distillation optimization procedure. (VectorFusion/figures/fileoutpart33.png)

Visual Question: What games are implemented in Nemobot, and how are they categorized according to Shannon’s game taxonomy?

Nemobot implements a variety of games categorized according to Shannon’s game taxonomy, which classifies machines based on their decision-making logic:

Category 1: Dictionary-Based Machines

Tic Tac Toe: Uses a dictionary-based system where the LLM agent dynamically generates optimal moves for game states, effectively compressing the state-action dictionary into its inference capabilities.
..contd

Sources:
• NemoBot > II. GAME PLAYING WITH LARGE LANGUAGE MODELS > A. Shannon’s Taxonomy of AI Games with LLMs
• NemoBot > IV. EVALUATION OF NEMOBOT GAMES > A. Dictionary-Based Game: Tic Tac Toe
• NemoBot > I. INTRODUCTION

AI-Verified Evidence List:
• NemoBot - Table I: Nemobot games categorized according to Shannon’s game taxonomy (NemoBot/tables/fileoutpart5.png)

Edge Cases & Design Trade-offs

LLM non-determinism

Since the image selection is being performed by the LLM, even with temperature = 0.0, repeated runs of the same query may surface slightly different images. Depending upon our preference, we may feel some are more relevant than others.

Child-node figures

Specific queries (eg; What is Vision Distillation Loss?) are likely to find sections with the particular concept and display the precise formula and figures much better, than broad ones (eg; Compare Vectorfusion pipeline with Vectorpainter). The broad queries are likely to retrieve header-level nodes (sections) whereas the associated figures may reside within child nodes, which do not fit within the k=5 context window. However, asking about either pipeline individually would work fine, since all 5 slots go to one paper, bringing adequate child nodes — and therefore the relevant figures — into context.

Detached image paths

This approach assumes the image path (e.g., `![](figures/fileoutpart11.png)`) physically exists within the retrieved section. If a figure is referenced in text but stored in a separate section (say, an Appendix) that isn’t retrieved, it won’t be surfaced. A practical workaround is to name image files in a way that can be derived — `table_1.jpg`, `figure_3.png` — so the synthesizer can construct the path from the reference, rather than relying on generic extractor names like `fileoutpart1.png`. Regardless of approach, the core principle holds: no multimodal embedding or visual interpretation is needed. Full section context is sufficient for the LLM to make intelligent image selections.

Open-Source Repository

Proxy-Pointer is fully open-source (MIT License) and can be accessed at Proxy-Pointer Github repository. The Multimodal pipeline is being added to the same repo in addition to the existing Text-Only version.

It is designed for a 5-minute quickstart:

MultiModal/
├── src/
│   ├── config.py             # Model selection (Gemini 3.1 Flash Lite)
│   ├── agent/
│   │   └── mm_rag_bot.py     # MultiModal RAG Logic
│   ├── indexing/
│   │   ├── md_tree_builder.py # Structure Tree generator
│   │   └── build_md_index.py  # Vector index builder
│   └── extraction/
│       └── extract_pdf.py     # Adobe pdf Extraction to MD logic
├── data/                      # Unified Data Hub
│   ├── extracted_papers/      # Processed Markdown & Figures
│   └── pdf/                   # Original Source PDFs
├── results/                   # Benchmarking Hub
│   ├── test_log.json          # 20-query results & metrics
│   └── test_queries.json      # Benchmark questions
├── app.py                     # Streamlit Multimodal UI
└── run_test_suite.py          # Automated benchmark runner

Key Takeaways

Multimodal RAG is not primarily a vision problem—it is a retrieval alignment problem.
The challenge is not extracting or embedding images, but confidently associating them with the right semantic context.
Chunk-based retrieval breaks visual coherence.
Sliding-window chunking fragments captions and disconnects images from their true semantic units, making reliable selection difficult.
Multimodal embeddings introduce ambiguity, not clarity.
Visually similar artifacts (e.g., tables, diagrams) are difficult to distinguish in the same vector space, making it hard to distinguish relevance without structural grounding.
Structure is the missing layer.
Treating documents as hierarchical semantic units allows images to inherit meaning from their section, enabling confident selection.
Proxy-Pointer reframes the problem.
Instead of searching for images directly, it retrieves sections and selects images conditionally based on full context —turning a hard retrieval problem into a simpler filtering task.
Accuracy matters more for visuals than text.
Showing an incorrect image can be more damaging than omitting one entirely, making precision critical for enterprise use cases.

Conclusion

Multimodal responses have long been seen as the next step in the evolution of RAG systems. Yet, despite advances in vision models and multimodal embeddings, reliably returning relevant images alongside text remains an unsolved problem.

The reason is subtle but fundamental: traditional RAG pipelines operate on fragmented chunks, while meaning—especially visual meaning—lives at the level of full document structure. Without aligning retrieval to semantic units, even the most advanced models struggle to make the correct visual associations.

Proxy-Pointer MultiModal RAG addresses this gap by upgrading the foundation from flat chunks to structured context. By retrieving complete sections and treating image paths as pointers to artifacts within them, it enables accurate, scalable, and cost-efficient multimodal responses—without relying on expensive multimodal embeddings.

The result is a practical step forward: chatbots that don’t just narrate, but show precise evidence —always grounded in the right context.

Clone the repo. Try your own documents. Let me know your thoughts.

Connect with me and share your comments at www.linkedin.com/in/partha-sarkar-lets-talk-AI

_{All research papers used in this article are available at CLIP , Nemobot , GaLore, VectorFusion and VectorPainter with CC-BY license. Code and benchmark results are open-source under the MIT License}. _{Images used in this article are generated using Google Gemini.}

Source link

nimda 1 hour ago

0 0 12 minutes read