Making a PDF’s Images Searchable for RAG, Without Paying to Read Them All

0 3 13 minutes read

Making a PDF’s Images Searchable for RAG, Without Paying to Read Them All

companion in Enterprise Document Intelligence, the series that builds an enterprise RAG system from four bricks. It extends Article 5 (document parsing) on one table: image_df, which locates every picture in the PDF without reading any of them. This part builds the reading toolbox: a cost-ordered cascade (a cheap filter, a type check, classic OCR, a vision model) that turns the few images worth paying for into searchable text.

where this companion sits: it extends Article 5 (document parsing), inside Part II (the four bricks), reading the images the parser only located – Image by author

The parsing brick gives you image_df: one row per image in the PDF, with its page, its bounding box, its size, a content hash. That locates every picture. It does not say what any of them shows. For retrieval, that is the same as not having them: a bounding box is not something a user can search, and the image’s text slot, the place a description would live, is empty.

The reflex is to throw a vision model at every image and be done. That is the wrong default. A real document is full of images that carry nothing a reader would ever search for: the company logo in every page header, a horizontal rule drawn as a 2-pixel-tall picture, a bullet glyph, a decorative banner. Captioning those with a vision LLM is paying a model to describe a logo three hundred times.

So the job splits in two. First, the methods that turn an image into text, and what each one costs: a cheap filter, a type check, classic OCR, a vision model. Second, which images are actually worth spending on in a given run. That second half is driven by context. A body line that reads “Figure 3 below shows…” is the cue to read that figure with a vision model, and not its neighbours; the question being asked narrows it further. This article lays down the methods and shows what each returns, ordered by cost. Choosing which images to pay for, per document and per query, is adaptive parsing, and it has its own article (Article 10). Here we build the toolbox.

*one extracted image in, a searchable description out, paying the cheapest method that can read it – Image by author*

1. Most images are not worth a model call

The first step spends nothing. Before any OCR or vision call, a cheap filter looks at signals already in image_df plus a couple of pixel statistics, and drops the images with no retrieval value:

Too small. An image whose shortest side is a few dozen pixels, or whose total area is below a small floor, is an icon or a bullet, not a figure. A size threshold removes most of them.
The wrong shape. A picture that is very long and very thin is a rule or a divider, not content. An aspect-ratio guard catches those.
Repeated everywhere. The same content hash on most pages of the document is chrome: a header logo, a footer mark, a watermark. Counting how many pages an image hash appears on flags it as decoration, not information.

is_worth_analyzing applies these size and shape rules per image, and flag_worth_analyzing first derives the per-page repeat frequency from the content hash, then adds a worth_analyzing column to image_df. Both live in docintel.parsing.pdf.images. The thresholds are deliberately loose: a false keep costs one model call later, a false drop loses content with no trace, so when in doubt the filter keeps the image. Flat, contentless images that are too big to fail the size test (a solid colour panel, say) are not forced through here; they are caught one step later as decorative and skipped just the same.

In: image_df (+ per-image pixel stats). Out: the same table with a worth_analyzing flag.

On a typical report, this alone removes the large majority of images before a single model runs. What’s left is the handful that actually carry meaning.

2. What kind of image is it?

The images that survive the filter are not all read the same way. A screenshot of a table is text: classic OCR reads it cheaply and exactly. A line chart is not text at all; its meaning is in the axes and the trend, and only a vision model can put that into words. Sending the chart to OCR returns a few stray axis labels; sending the screenshot to a vision model pays chart prices for something OCR does for free.

So the second step classifies each kept image into one type:

decorative: a blank or near-uniform panel. Skip.
text: a screenshot, a scanned region, a table rendered as an image. Reads with OCR.
chart / diagram / photo: the meaning is visual. Reads with a vision model.

classify_image returns one ImageType from cheap pixel signals: how much the pixels vary, how saturated they are, how much of the image is near-white background, how dense its edges are. A near-uniform panel is decorative. The test there is worth dwelling on, because the obvious version is wrong: you cannot detect a blank panel by counting its colours. A real “all-black” or “all-white” region is never pixel-perfect; sensor noise and JPEG compression give it hundreds of near-identical colours, so a colour count sails right past it. What stays near zero on a blank panel, noise and all, is the dispersion of the pixel values, their standard deviation. Low dispersion means blank, whatever the colour count, so that is the signal. Black ink on a white page, near-zero saturation with real stroke structure, is text. A saturated, full-bleed image with no white margins is a photo. Everything else, every uncertain case, falls through to chart.

Notice what is not in that list: a step that decides “this looks like a logo”. That is on purpose, and it is the same lesson as the blank panel. A logo can be two flat colours, a black wordmark on white, or a full-colour gradient with soft edges. Counting colours catches the first and misses the second, and worse, the two-colour test also catches a bilevel scan of real text you wanted to read. Appearance does not tell you it is a logo. Behaviour does: a logo is chrome because it repeats, the same mark in every page header. That signal already ran, back in the filter, which drops an image whose content hash recurs across pages no matter how many colours it has. A logo that appears only once, a mark on a cover page, is not worth a special case; it gets read like anything else, a wordmark falling to free OCR, a graphic to a single vision call. The rule throughout is the same: skip only what you are sure is empty or chrome, and read everything else, because a wrong skip loses content silently.

That fall-through to chart is the other important design choice. Classifying a chart against a diagram against a photo on cheap signals alone is not reliable, so the classifier does not try to be clever: it only diverts an image to cheap OCR when it is confident the image is clean monochrome text, and sends everything else to the vision model, which reads charts, diagrams, photos, and any text they happen to contain. The bias is asymmetric on purpose. A missed OCR shortcut costs one vision call; OCR run on a diagram returns a handful of stray axis labels and nonsense. So when in doubt, the classifier pays for vision. Classification itself stays cheap, no model call, because it has to be cheaper than the analysis it is there to avoid.

In: an image that passed the filter. Out: its ImageType.

3. The cascade: the cheapest method that can read it

Type decides method. METHOD_BY_TYPE maps each type to one of three actions, ordered by cost, and describe_figure dispatches on it. The whole decision, for the cases you actually meet in a document, fits in one table: what catches the image, what it costs, and what you get back.

*the cascade decision for every image kind you meet in a real document, from free to paid – Image by author*

Read it top to bottom and you read the cascade in order. The first three rows never reach a model at all: the filter throws them out on size, shape, or repetition. The next row is caught by the classifier as a blank panel and skipped too. Only the bottom five cost anything, and of those only the genuine text-image gets the free path. The rest reach the vision model, which is exactly where you want your money going.

Watch out: sideways figures. A wide table or a landscape chart is often laid at 90 degrees to fit a portrait page. The turn rarely shows up where you would look first: the page’s rotation flag stays at 0, and the angle sits in the image’s own placement matrix instead. Rendered as-is, the figure reaches OCR or the vision model on its side, where OCR returns noise and a vision model reads it with misplaced confidence and no warning that it struggled. So the cascade reads the placement angle and counter-rotates the region before either method sees it: automatic, exact, no orientation-guessing. The one residual case is a scan with the turn baked into its pixels, with no matrix to read; there the OCR branch retries the quarter-turns and keeps the best-scoring one.

3.1. Skip: pay nothing for the noise

decorative: no call. A blank or near-uniform panel keeps its empty text slot. Together with the images the filter already dropped (the too-small, the wrong-shaped, the repeated chrome), this is where most of a clean document’s images end up, which is the point.

3.2. Classic OCR for text-images

text: a screenshot, a scanned table, a figure that is really rendered text. Classic OCR reads it locally, in milliseconds, for free. The series uses EasyOCR (docintel.parsing.pdf.easyocr); Tesseract is the other common choice. OCR is exact on clean printed text and never invents words, which is exactly what you want when the image is text. Its companion article (Article 5 quinquies) covers OCR as a parser back-end in full; here it is one branch of the cascade.

The catch is handwriting. A handwritten note looks like text to the classifier, but classic OCR is trained on print and reads cursive as a string of guesses. The fix is to let OCR report how sure it is. EasyOCR returns a confidence score with every line, so describe_figure reads the text and its mean confidence: a confident read is returned as is, a low-confidence read is treated as a failed attempt and the image falls through to the vision model, which handles handwriting far better. Same path covers the rarer case where the classifier mistyped a non-text image as text. So the OCR branch is not “trust OCR blindly”; it is “try the free reader, keep its answer only when it is sure, otherwise pay for vision”.

3.3. Vision LLM for charts, diagrams, and photos

chart, diagram, photo: the only images where the meaning is genuinely visual. A vision model looks at the picture and writes a short description, “a line chart of commodity prices since 2022, rising then flat after Q3”, “the Transformer architecture, an encoder of N stacked layers feeding a decoder”. That sentence is text, so retrieval can finally match it. This is the one thing no textual parser can do, and it is the costliest step, so the whole cascade exists to make sure only these images reach it. The vision call itself goes through docintel.core.analyze_image, the one place every model call in the series lives (alongside llm_parse and llm_chat); the cost it carries is the subject of Article 5quater (vision reading).

The classifier already knows the type, so the prompt is tuned to it instead of one generic “describe this image”. A chart is asked for its axes, units, and trend; a diagram for its components and how they connect, with every label transcribed; a table rendered as an image is asked for its rows back as markdown; a photo for what it shows. The right question pulls the right answer: ask a chart for its trend and you get the trend, ask it to “describe the image” and you get a sentence about colours. A caller can still pass one explicit prompt to override the type-specific ones, which is how a project-scoped or user-edited instruction flows through.

In: a typed image. Out: a short description, or None for a skip.

4. Writing the description back

The description is only useful if retrieval can find it. The image already has a line slot in line_df (an image sits at a position on the page, so it occupies a line, with an empty text cell, as covered in Article 5B (the relational data model)). The cascade writes its description into that cell. describe_image_df adds a description column to image_df, and the caller joins it back onto the image’s line.

The effect is that “the architecture diagram” or “the revenue chart” now retrieves the right page, through the same keyword and embedding path as any other line. Nothing downstream needs to know the text came from a picture.

The enrichment is incremental by design. You can run the cascade at parse time for a small corpus, or lazily, only on the images a given run actually needs. The text slot is empty until something fills it, and filling it never changes the contract: it is still one row, one line, one text value. When to fill it is the open question this article leaves for adaptive parsing (Article 10): rather than read every figure up front, the cheap text is read first, and a cross-reference in that text (“Figure 3 below shows the gains”) is what triggers a vision call on the figure it points to. The methods here are what that policy will call; the policy itself is the next article.

The whole cascade ships as one call. Hand it the image_df from parse_pdf and the pdf_path it was parsed from, read back the same frame with the three new columns the cascade fills.

parsed = parse_pdf("data/paper/1706.03762v7.pdf")    # image_df locates the pictures
enriched = describe_image_df(parsed["image_df"], pdf_path="data/paper/1706.03762v7.pdf")

# describe_image_df adds three columns to image_df:
enriched[["page_num", "worth_analyzing", "image_type", "description", "prompt"]].head()
# worth_analyzing : the cheap filter's verdict       (True/False)
# image_type      : "decorative" | "text" | "chart" | "diagram" | "photo" | None
# description     : the searchable text written into the image's line slot
# prompt          : the instruction sent to the vision model (None for OCR / skip)

This is also the part of the cascade a user can see and correct. The screenshot below is a desktop document app running the same pipeline on NIST AI 100-1 (the AI Risk Management Framework, a US Government work, public domain): the Images tab lists every figure the parser located, the selected diagram carries the description gpt-4.1 wrote for it, and the description stays editable. Per-image controls re-run OCR or force the vision model when the cheap path got it wrong.

*the cascade surfaced to the user: every located figure, its description written into the document model, and the per-image controls to re-run OCR or force vision – Image by author*

5. Cost and latency: pay per image, not per page

The cascade’s whole purpose is to make the cost track the value. The cheap filter and the classifier run on every kept image but cost effectively nothing. OCR is local and free. The vision model, the one line item that actually costs money and seconds, runs only on charts, diagrams, and photos, which on most enterprise documents are a small fraction of the images and a tiny fraction of the pages.

The alternative, captioning every image with a vision model, costs the same per image whether it is a logo or a chart, and most images are logos. The cascade replaces a flat per-image vision bill with a filter, a cheap classifier, and a vision call only where nothing else can read the picture. On a report with one logo per page and two real figures, that is two vision calls instead of dozens.

The same image is also never paid for twice. The filter already drops chrome that recurs on most pages, but a real figure can still appear on a handful of pages (a reference diagram, a repeated exhibit). The cascade keys on the content hash, so a figure that shows up on ten pages is read once and the description is reused for the other nine. One image, one model call, however many times it appears.

6. Conclusion

image_df locates every picture; it does not read any of them. Reading them is a separate brick, and this article lays down its methods, ordered by cost: drop the noise for free, classify what’s left cheaply, read clean text with OCR, and keep the vision model for the charts and diagrams where the meaning is genuinely visual. Each method leaves its result in the image’s text slot, and from there an image is just another searchable line. What this article deliberately does not settle is which images to run in a given pass: reading every figure up front is rarely what you want, and the context-driven choice, letting the surrounding text and the question decide, is adaptive parsing (Article 10). The toolbox first; the policy next.

Sources and further reading

Article 5 (parsing) and Article 5B (the relational tables) introduce image_df and the line slot the description is written back into.
Article 5 quater (vision reading) covers the vision-LLM back-end and its cost.
Article 5 quinquies (EasyOCR) covers classic OCR as a parser back-end.
Article 10 (adaptive parsing) is where the choice this article defers gets made: which images to read in a given run, escalating from cheap text to a vision call only where the context asks for it.

Earlier in the series:

Document Intelligence: series intro. What the series builds, brick by brick, and in what order.
Baseline Enterprise RAG, from PDF to highlighted answer. The four-brick pipeline end to end: PDF in, highlighted answer out.
Embeddings Aren’t Magic: The Predictable Failure Modes of RAG Retrieval. Where embedding similarity wins (synonyms, typos, paraphrase), where it predictably breaks (unknown terms, negation, term-vs-answer relevance), and how to use it anyway.
Rerankers Aren’t Magic Either: When the Cross-Encoder Layer Is Worth the Cost. What a cross-encoder adds over bi-encoder embeddings, measured, and when it is worth the latency.
RAG is not machine learning, and the ML toolkit solves the wrong problem. Why chunk-size sweeps and finetuning optimize the wrong thing; route by question type instead.
From regex to vision models: which RAG technique fits which problem. Two axes, document complexity and question control, that pick the technique for each case.
10 common RAG mistakes we keep seeing in production. Ten production mistakes, organized brick by brick, with the fix for each.
Beyond extract_text: the two layers of a PDF that drive RAG quality. The first half of the parsing brick: the document’s nature, signals, and summary.
Stop returning flat text from a PDF: the relational shape RAG needs. The second half of the parsing brick: the relational tables every downstream brick reads.

Source link

nimda 3 weeks ago

0 3 13 minutes read