How to use visual language models in long documents

0 1 8 minutes read

How to use visual language models in long documents

Dynamic models take images as input, instead of text like traditional llms. This opens up many possibilities, considering that we can directly process the content of the document, instead of using OCR to extract the text, and feed this text to LLM.

In this article, I will discuss how you can use visual linguistic models (VLMS) for long-term task understanding tasks. This means using vlms for very long documents over 100 pages or dense documents that contain a lot of information, such as diagrams. I will discuss what to consider when using VLMS, and what kind of tasks you can do with them.

This highlights the main content of this article. I'll cover why VLMs are so important, and how you can use them in long documents. For example you use VLMS to get more advanced OCR, including some document information in the extracted text. In addition, you can apply VLMS directly to the document images, but you have to consider the required processing power, cost and latency. Photo by chatgpt.

Why do we need VLMS?

I have discussed VLMS a lot in my previous articles, and covered why they are so important to understand the content of other documents. The main reason VLMS is needed is that most of the information in the documents, requires visual observation to understand.

Another way to use VLMS is to use OCR, then use LLM. The problem here is that you are only extracting text from the document, and not including visual information, such as:

Where separate text is reserved relative to other texts
Non-text information (basically anything that isn't text, like symbols, or drawings)
Where text is placed relative to other information

This information is often very important to really understand the document, and you usually close the VLMS directly, where you feed directly on the image, and therefore be able to interpret the visual information.

For long documents, using VLMS is a challenge, considering that you need many tokens to represent visual information. Processing the headers is therefore a big challenge. However, with many recent advances in VLM technology, models have gotten better and better and compress visual information into the right context, making it easier and more convenient to use VLMs on long documents in long literary works.

This figure highlights the OCR + LLM method you can use. You take your document, and use OCR to find the text of the document. It then feeds this document, along with the user's question to LLM, which responds with the answer to the question, given in the document text. If you use VLMS instead, you can skip the OCR Step altogether, and answer the user's question directly from the document. Photo by the author.

OCR uses VLMS

One good option to process long documents, and still include visual information, is to use VLMS to do OCR. Traditional OCR such as Tesseract, only extracts text directly from documents and a text bound box. However, VLMs are also trained to perform OCR, and can perform advanced text extraction, such as:

Extracting Markdown
Describing visual details (ie if there is a diagram, describe the diagram in text)
Adding missing information (ie there is a box that says A palm tree and a blank field behind, you can tell OCR to extract A palm tree We are divided

Recently, Deepseeek released a powerful VLM model based on OCR, which has received a lot of attention and traction recently, making VLMS for OCR very popular.

Marking

Markup is very powerful, because it extracts edited text. This allows the model to:

Provide headings and subheaders
They represent the tables accurately
Make the text bold

This allows the model to produce a more representative text, it will accurately reflect the content of the text documents. If you now apply LLMS to this document, the LLMs will perform better than if you apply to a plain text extracted with traditional OCR.

LLMS performs better on text formatted like Markdown, than on plain text extracted using traditional OCR.

Describe physical details

Another thing you can use is VLM OCR for VLM for describing visual information. For example, if you have a drawing without text in it, traditional OCR will not extract the information, because it is only trained to extract text characters. However, you can use VLMS to describe the visual content of an image.

Imagine you have the following text:

This is the introduction text of the document



This is the conclusion of the document

If you have applied a traditional OCR such as Tesseract, you will get the following result:

This is the introduction text of the document

This is the conclusion of the document

This is obviously a problem, because you don't include details about the picture showing the Eiffel tower. Instead, you should use VLMS, which outputs something like:

This is the introduction text of the document


This image depicts the Eiffel tower during the day


This is the conclusion of the document

If you used LLM in the original text, you would not actually know this text contains a picture of the Eiffel tower. However, if you used LLM on a secondary document that was extracted through VLM, LLM would certainly be better at answering questions about the document.

Enter the missing information

You can also trigger VLMS on outgoing content if any information is missing. To understand this concept, look at the image below:

Why VLMS is important — This figure shows a typical example of how information is represented in a document. Photo by the author.

If you use traditional OCR on this image, you will get:

Address Road 1
Date
Company Google

However, it can be more representative if you use VLMS, if it is ordered, it can overflow:

Address Road 1
Date  
Company Google

This is very instructive, because we have experience with any DownTream model, that the date field is empty. If we do not provide this information, it is impossible to know that it is late if the date is simply lost, OCR could not extract it, or any other reason.

However, OCR using VLMS still suffers from some of the traditional OCR problems, because it does not process visual information directly. You may have heard that A picture is worth a thousand wordswhich is often true for processing visual information in texts. Yes, you can give the description of the drawing text with VLM as OCR, but this text will never be as descriptive as the drawing itself. Therefore, I argue that you are much better off processing documents directly using VLMS, as I will cover in the following sections.

Open source vs closed source models

There are many types of VLMS available. The Follob VLM Leaderboard for HugGRFAFAFAFA to keep an eye on any new top models. According to this leaderboard, you should go with Gemini 2.5 Pro, or GPT-5 if you want to use closed source models using the API. From my experience, these are good options, which work well for understanding long documents, and managing complex documents.

However, you may want to use open source models, due to privacy, cost, or to have more control over your application. In this case, Sensenova-v6-5-pro is a collection of the best. I have not tried this model personally, but I have used QWWEN 3 VL a lot, with whom I have good experience. Qwen has also released a specific cookbook for understanding long documents.

Vlms to long documents

In this section I will talk about applying VLMS to long documents, and the considerations you should make when doing so.

Considers Power Considerations

When using an open source model, one of your main considerations is how big of a model you can use, and how long it takes. You have increasingly access to a large GPU, atleast a100 in most cases. Fortunately this is widely available, and very cheap (it costs 1.5 – 2 USD per hour for most cloud providers now). However, you should keep looking at the latency you can accept. Running vlms requires a lot of processing, and you should consider the following things:

How long does it take to process one request
What image resolution do you need?
How many pages do you need to process

If you have a live chat for example, you need a fast process, however if you are just processing in the background, you can allow longer processing times.

Image resolution is also an important consideration. If you need to be able to read text in documents, you need high-resolution images, usually more than 2048 × 2048, although of course it depends on the text. Detailed drawings for example with small text on them, will require high resolution. Increase the resolution, you greatly increase the processing time and it is a critical consideration. You should aim for the lowest maintenance that still allows you to do all the tasks you want to do. In addition, the number of pages is considered for the same. Adding multiple pages is sometimes necessary to access all the information in the document. However, more often than not, the most important information is found early in the Document, so you can get away with reviewing only the first 10 pages.

Answer dependency processing

Something you can try is to reduce the required processing power, starting with the simple, and early in the difficult configuration if you don't get the desired answers.

For example, you can start looking at the first 10 pages, and see if you can correctly solve the task at hand, such as extracting a piece of information from a document. Only if we can't extract a piece of the component, we start looking at other pages. You can apply the same concept to the resolution of your images, starting with low resolution images, and moving to the higher resolution needed.

This Hierarchical operation reduces the required processing power, because many tasks can be solved only looking at the first 10 pages, or using low resolution images. After that, only if necessary, we proceed to process more images, or higher resolution images.

Pay it

Cost is an important consideration when using a VLMS. I've processed many documents, and I usually see a 10x increase in the number of tokens when I use images (VLMS) instead of text (llmms). Since input tokens are often a cost driver in long document operations, using a VLMS is often cost prohibitive. Note that with OCR, the point about more input tokens than output tokens doesn't apply, because OCR naturally generates more output tokens when extracting all text from images.

Therefore, when using VLMS, it is very important to maximize your use of cached tokens, a topic I discussed later in my recent article about optimizing LLMs for cost and latency.

Lasting

In this article I discussed how you can use visual language models (VLMS) on long documents, to handle complex text understanding tasks. I discussed why VLMs are so important, and approached using a VLMS for long documents. You have for example used VLMS for OCR with more difficulty, or directly put VLMS in long documents, although with safety in terms of processing power required, cost and latency. I think vlms are becoming more and more important, highlighted by the recent release of Deepseek OCR. That way I think that VLMS for document understanding is a topic that we should engage with, and you should learn how to use VLMS for document applications.

👉 Find me in the community:

📩 Subscribe to my newsletter

🧑💻 Get in touch

🔗 lickEdin

🐦 X / Twitter

✍️ Medium

You can read my other articles:

Source link

nimda 2 days ago

0 1 8 minutes read