Machine Learning

Using the language modes of vision processing millions of documents

(Vlms) Models have power to read a machine that can process view and observation information. With the latest QWEN 3 VL release, I want to get a deeply stolen from using this powerful VLMS to process the documents.

Content

Why do you need to use VLMS

To highlight why some activities require VLMS, I want to start with an example of the example, where we need to translate the text and visual information to the text.

Imagine looking at the picture below. Checkboxes represent whether the document must be inserted in a report or not, and now you need to decide which documents should enter the documents that the documents should enter.

This figure highlights the appropriate VLMS problem. You have a picture containing text about the documents, as well as the checkboxes. You also need to determine which documents are checked by the test boxes. This is difficult to solve the llms, because first you need to use the OCR in the picture. The scripture is a waste of its viewing position, which is required to resolve the work. With VLMS, you can easily read text in the document, and use its view position (if the text is above or not as a function, and process the work. Photo by author.

Of man, this is an easy work; Obviously, documents 1 and 3 should be included, while the document 2 should be installed. However, if you try to solve this problem with a clean vent, you can experience problems.

To conduct a clean llm, you will first need to include a picture, where the OCR lists would look like, if you are using Google tesseract, for example, the text of the text.

Document 1  Document 2  Document 3  X   X 

Since you may find, the llm will have problems to decide which scriptures should include, because it is impossible to know which documents are XS are for XS. This is one of the many cases when VLMs works very well in solving the problem.

The main point here knows which documents with Checkbox X requires that of visual and basic information. You need to know the text and the interface position of the text in the picture. I summarize this in the bottom rate:

VLMS is required when the purpose of the text depends on its viewing state

App areas

There is a plethora of places that can put VLMs to it. At this stage, I will integrate some different places where VLMs has proved that it is useful, and where I have used vlMs successfully.

Agentic use charges

The air agents are now days, and VLMs also plays a role in this. I will highlight two main places where VLMs can be used in the Aventic context, although naturally, other areas.

Use of computer

The use of the computer is a happy vlMS use case. By using a computer, I refer to the VLM towards the frame on your computer and decide which action we should take next. One example of this Opelai's Operator. This, for example, can look at the outline of this current article, and tear down to learn more about this article.

VLMS is useful in computer usage, because llm is not enough to decide which actions to take. When working on a computer, you often translate the position of the buttons and information, which described the beginning, is one of the main places to use VLMS.

To adjust the error

Debugging Code is a useful place for Agentic application of VLMS location. Imagine that you are developing a web system, and get a bug.

One option starts to sign in to the console, Copy logs, explain the cursor what you have done, and immediately the cursor to correct it. This is natural, as it requires many measures of the manual from the user.

Another option must use VLMs to better resolve the problem. In appropriate, it explains how the VLM can be productive, the VLM can also log in to Flow, check the problem, and thus correcting the wrong. There are requests designed areas such as these, although most can reach the development of the progress I have seen.

Answer the question

Using VLMs for the visual question of response is one of the old ways to use VLMS. Answering the question is a case of using the beginning of this article about finding out which checkbox is not literal to those texts. Took VLM for the user question, and a number of pictures (or several pictures), vlm processing. VLM will then give us feedback in text format. You can see how this process works in the calculation below.

This figure highlights the task of responding to a question where I used a VLM to solve the problem. Supporting a problem with a problem, a question containing resolving work. VLM and process this information and releases expected information. Photo by the writer,

You must, however, you have weights trading – VLMS VS LLMS usage strategies. As expected, when the work needs written information and viewing, you need to use VLMs for the relevant result. However, VLMs are also often costly to run, because they need to process multiple tokens. This is because the images contain many details, thus leading to many tokens to install processing.

In addition, if Vlm processing text, you also need highlights, allow VLM to translate pixels to make letters. For low decisions, VLM strives to study text in photos, and you will get low-quality results.

To schedule a particular type

This figure is covering how to use VLMs in classification activities. Taking VFM with a picture photo, a question to classify the document in one of the previously described categories. These categories should be included in the question, but are not included in the figure due to space limitations. VLM and removing the predicted separation label. Photo by author.

Another exciting place for the VLMS app is to be classified. In paragraphs, I refer to the position of predetermined sections and you need to decide which category is a photo.

You can use a separation vlms, the same way as using the llms. He creates the united immediately that contains all the relevant information, including output categories. In addition, perhaps tender different cases on the edge, for example, in cases where the two categories are available, and VLM should decide between the two categories.

To do, for example, have a short time as:

def get_prompt():
    return """
        ## General instructions
        You need to determine which category a given document belongs to. 
        The available categories are "legal", "technical", "financial".

        ## Edge case handling
        - In the scenario where you have a legal document covering financial information, the document belongs to the financial category
        - ...
        ## Return format
        Respond only with the corresponding category, and no other text 
    """

You can also use VLMs issuing information, and there are many activities to issue information that requires visual information. You create the same speed in the renewal of my separation I have created above, and it is usually the VLM to respond in a format format, such as JSON's item.

When making issuance information, you need to check how many data points you want to remove. For example, if you need to get 20 different data points from the document, you probably do not want to turn all at the same time. This is because the model will strive to bring well that information to one thing.

Instead, you have to consider divorce, for example, to bring about 10 data points, with two different applications, to do the model work. On the other side of the conflict, sometimes you will meet that some data points are related, meaning that they should be removed from the same request. In addition, sending several applications increases measuring costs.

This figure highlights how to use VLMs to create a database. He also consumes VLM a document photo, and speeds up and speeds a VLM to issue certain data points. In this statist, I move the VLM to release the date of the document, the location mentioned in the document, and the type of document. VLM and analyzes the Prompt and a document photo, and issuing a JSON item consisting of the requested information. Photo by author.

When VLMS is a problem

The wonderful models of models can do activities that can be thought to solve AI just a few years ago. However, they too have their limitations, which will cover us at this stage.

Cost of Active VLMS

The first limit are the cost of the active VLMS, which has studied briefly discussed in this article. Images of VLMS process, which contain many pixels. These pixels show many details, installed in the tokens that the VLM can process. The issue is that as images contain so many details, you need to create multiple tokens each picture, which also increase the costs to function properly VLMS.

In addition, you usually need photos of the decrease, since VLM is required to read the text in photos, resulting in multi-operative tokens. So VLMS is very expensive to run, over API, but at the cost of payments if you decide to hold VLM.

Unable to process long Documents

The number of tokens containing photos also measures the amount of VLM pages on the same time. VLMS can be controlled by its context, such as traditional LLMs. This is a problem if you want to process long documents containing hundreds of pages. As expected, you can split the document into chunks, but you can experience problems when VLM is not available for all the content of the document in One Go.

For example, if you have 100-page documents, you can process pages 1-50, and process pages 51-100. However, if some details on page 53 may require a theme from page 1 (for example, title or document date), this will result in news.

To learn how to deal with this problem, I read the QWEN 3 of Cookbook, when they have a page on how they used ultralong documents. I will be sure I test this and discuss how it works in the future article.

Store

In this article, I discussed the language models of vision and how you can use them in different areas of problems. I first explained how to combine VLMs in Agentic programs, for example, such as computer usage, or debug web programs. Continuous, I covered areas such as the question, isolation, and information issued. Finally, I also covered the limitations of VLMs, discussing the cost of including VLMs and how they are fighting with long documents.

👉 I have found in the community:

🧑💻 Contact your

🔗 LickDin

🐦 x / Twitter

✍️ Medium

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button