Bringing cognitive linguistics to colpali

Have you ever tried to build a rag (Retrieval-Augmented Generation) Application, you may be familiar with the challenges posed by tables and images. This document explores how to deal with these types using visual language models, specifically the Colpali model.
But first, what exactly is a rag – and why do tables and pictures make it so difficult?
Rag and mess
Imagine you are faced with a question like:
What is our company's cash handling policy?
A basic LLM (large language model) probably won't be able to answer this, since such details are specified and usually not included in the training details of the model.
That's why a common way is to connect LLM to a knowledge base – like a SharePoint folder containing various internal documents. This allows the model to retrieve and insert relevant context, enabling it to answer questions that require specialized information. This method is known as Retrieval-Augmented Generating (RAG), and often involves working with documents such as PDFs.
However, extracting relevant information from a large and diverse knowledge base requires extensive documentation. Common steps include:
- Crossing over: Parsing documents into text and images, often taking advantage of OCR tools like Tesseract. Tables are often converted to text
- Formation: Preserve the structure of the text, including headings, paragraphs, by converting the extracted text into a format that preserves the context, such as markup
- Wrapping up: To split or combine text passages, so that the context is maintained in the content window without creating concatenated found verses
- Enrichment: Provide additional metadata e.g. Extract a keyword or provide an abbreviation to the chunks for discovery. Optionally, and captions with descriptive text with multimodal llm to make the pictures check
- Embedding: Embed texts (and possibly images to provide a multimoral ostracism), and store them in a vector db
As you can imagine, the process is very complicated, involves many attempts, and is very brittle. What's worse, even if we try to do it as best as possible, this intersection may not work after all.
Why parings often fall short
Tables and graphics are often present in PDFs. The image below shows how they are typically combined for LLM applications:
- Documents they are reduced
- Tables they are turned into scripts, anything contained within is copied without keeping the table boundaries
- Pictures are fed into a multimodal LLM for text summarization generation, or alternatively, the original image is fed into a multimodal input model without needing to generate a text summarization
However, there are two problems with this traditional method.
# 1. Complex tables cannot be interpreted as texts
Taking this table as an example, we as a person will interpret that the temperature change of > 2˚c to 2.5˚C 'The concept on Life Is A rise of 2.3˚c by 2080 puts up to 270 million at risk from malaria

However, if we convert this table to text, it looks like this: Temperature change Within EC target <(2˚C) >2˚C to 2.5˚C >3C Health Globally it is estimated that A rise of 2.3oC by 2080 puts A rise of 3.3oC by 2080 an average temperature rise up to 270 million at risk from would put up to 330...
The result is a jumbled block of text with no visual definition. Even for a human reader, it is impossible to extract any meaningful insight from it. When this type of text is fed by the multiple linguistic model (LLM), it fails to produce an accurate interpretation.
# 2. Separation between text and images
Image descriptions are often included in texts and separated from each other. Taking the below as an example, we know that the chart represents “the simulated costs of climate change in a clean scale of the selection of time periods and weight loss)”

However, since this is separated, the description of the image (combined text) will be separated from the image (combined chart). So we would expect, during the rag, the picture would not be available as input when we raise a question like “What is the cost of climate change?”

Therefore, even if we try to develop engineering solutions that preserve as much information as possible during a crash, they often fail when faced with real-world situations.
Given how critical attention is to a Rag application, does this mean that rags are destined to fail when working with complex documents? That's not the case. With Colpali, we have a refined and efficient way to manage it.
What is colpali?
The basic premise of Colpali is simple: Human reading is read as pages, not “chunks”, so it makes sense to treat the PDF, then convert the PDF pages into images, and use that as the context of the LLM to give the answer.

Now, the idea of embedding images using multimodal models is not new – it is a common method. So what makes Colpali stand out? The key lies in its adaptation from Colbert, a model that moves the input to multiple vectors, making the search more accurate and efficient.
Before diving into Colpali's abilities, allow me to briefly explain what Colbert is.
Colbert: Granular forwarding, context for forwarding documents
Colbert is an embedded text and updating process that updates multiple vectors to improve the accuracy of text searches.
Let's look at this case: We have this question: is Paul vegan?we need to identify which text chuck contains the relevant information.

Well, we have to point Text Chunk A as appropriate. But if we use one input model (text-ada-002), it will return Text Chunk B In turn.
The reason we find that they are the owners of one BI-vector – like text-ada-002 – works. They try to squeeze the whole sentence into one vector, without inserting certain words one by one in a way that knows how. On the contrary, colbert embedding each word with true awarenesswhich led to destruction, Multi-Vector a representation that captures many humorous details.

Colpali: Colbert's brother for handling images like documents
Colpali follows the same philosophy but works on Docud-like images. As Colbert breaks down the text and embeds each word one by one, Colpali separates the image from the edge and reveals the embedding of each patch. This method saves more than Situational details, enabling accurate and logical interpretation.

Besides achieving high accuracy, Colpali's advantages include:
- – Chugca: Colpali enables word-level comparisons between query and image of individual documents. This allows us to better understand and explain why a particular document is considered relevant.
- Reduced development effort and greater resilience: By eliminating the need for complex built-in pipelines – such as chunking, OCR, and structure parsing – Colpali significantly reduces development time and reduces potential points.
- Performance gain: Faster embedding and retrieval processes, resulting in better system response.
Now you know what colpoli is, let's log into the code and see if colpali can solve the challenges mentioned before!
For example
My code can be found on my github. A few words about it:
- It's an addition: Running the code requires a machine with an A100 GPU
- Embedding the model: Colpali has a lot of variations, I use
vidore/colqwen2-v0.1display. You can refer to the leaderboard here and switch to other models - Agent:
– llm: I use peppestrouter to consume LLM, and the agent is enabledGPT-4o. You can replace it with any multimodal LLM that can take an image as input
– Incherstation:LangGraphused to improve the agent - The Library is used to convert PDF to images:
pdf2imagewhich is a wrapper forpoppler. So please make sure you have it toopoppler– at tf - Sample data: “The impacts and costs of climate change” was written by Paul Watkiss et al, available in the public domain here
The result
The code is straightforward, so I'll jump to the conclusion: How best to deal with the problems we saw in “Why do parsing often fail?” part?
# 1. Complex tables cannot be interpreted as texts
Ask: What is the health impact if the temperature rises above the industrial level by 3 degrees?
The answer: If temperatures rise above pre-industrial levels by 3 degrees Celsius, the health impact will be significant. Specifically, it is estimated that by the 2080s, up to 330 million people would be at risk from malaria. This is a huge increase compared to the 1920s, when the number at risk was between 50 and 100 million.
This is well referenced from this table in the PDF!

# 2. Separation between text and images
Ask: What are the modeled costs of climate change, according to the Greenbook descending model?
The answer: According to the Greenbook descending model, the simulated cost of climate change is 7.2 Euro / TCO2
The correct answer should be 7.4 Euro / TCO2, but we see it close to correct!

Lasting
Traditional pipelines also struggle with subtext content. Colpali treats each PDF page as an image, allowing it to process visual layouts, tables, charts, and embedded graphics – dynamic formats that conventional parsers tend to distort or ignore.
Colpali brings the intelligence of visual languages to the rag, making it more efficient in handling real-time, multimodal business documents.



