Machine Learning

How to use Frontier Vision LLMS: Qwen3-VL

(VLMS) are powerful models that have the ability to include both images and text, and respond to text. This allows us to extract visual details from documents and images. In this article, I will discuss the newly released QWEN 3 VL, and the powerful VLMs capabilities it has.

The Qwen 3 VL was released a few weeks ago, initially with the 235b-A22b model, which is quite a large model. They have been releasing the 30B-A3B, and now they have released the compact 8B and 8B versions. My purpose of this article is to highlight the capabilities of visual language models and to inform you of their capabilities at a higher level. I will use the Qwen 3 vl as a specific example in this article, although there are many other high-end VLMs available. I am not affiliated in any way with Qwen when she wrote this article.

This booklet covers the main topics of this topic. I will discuss visual language models, and how they are in many cases better than using OCR + LLMS to understand documents. In addition, I will discuss using VLMS for OCR, and information extraction with Qwen 3 VL, and finally I will discuss some VLMS jokes. Photo by chatgpt.

Why we need visual language models

Visual language models are needed because the alternative is to rely on OCR and consume OCR-ED text in LLM. This has several problems:

  • OCR is not perfect, and LLM will have to deal with the output of incomplete text
  • You lose the information contained in the visual form of the text

Traditional OCR engines such as tesseract have long been essential for document processing. OCR has allowed us to insert images and extract text from them, enabling the processing of the content of the document's content. However, traditional OCR is not far behind, and it can fight problems such as small text, grassy images, static text, and so on. If you have a bad OCR result, you will struggle with all the good jobs, whether you are using regex or LLM. You feed images directly to VLMS, instead of OCR-ED text to LLMS, so you are more efficient in using the information.

The visual position of the text is sometimes important in understanding the meaning of the text. Consider the example in the image below, where you have checkboxes highlighting which text is relevant, where some checkboxes are checked, and others are not. You can then have some text corresponding to each checkbox, where only the text next to the checkbox is removed. Extracting this information using OCR + LLMS is challenging, because you don't know what text the checkbox was extracted from. However, solving this task using visual language models is not trivial.

This example highlights a situation where visual language models are needed. If you simply ocr the text, you will lose the visual position of the duplicated checkboxes, and therefore it is a challenge to know which three documents are relevant. Solving this task using the language model of perception, however, is easy. Photo by the author.

I fed the image above to the Qwen 3 VL, and it responded with the response shown below:

Based on the image provided, the documents that are checked off are:

- **Document 1** (marked with an "X")
- **Document 3** (marked with an "X")

**Document 2** is not checked (it is blank).

As you can see, the Qwen 3 VL easily solved this problem in the right way.


Another reason we need a VLMS is that we also get video insights. True understanding of the video clips would have been a major challenge using OCR, as most of the information in the videos was not displayed in text, but displayed as an image directly. So OCR failed. However, the new generation of VLMS allows you to insert hundreds of images, for example, overlay video, allowing you to perform video recognition tasks.

Functions of the linguistic model of language

There are many functions that you can use visual language models to. I will discuss a few relevant jobs.

  • OCR
  • Disclosure of information

Details

I will use the image below as an example for my test.

I will use this image for my Qwen 3 VL test. This image is a publicly available document from the planning authority of Oslo Municipality in Norway (“plan Og Bygningsetaten”). Im using this image as it is an example of a real book that you would want to use the visual language model on. Note that I have completed the image, as we originally received the painting as well. Unfortunately, my local computer is not powerful enough to process such a large image, so I decided to post it. This allows me to use the image with Qwen 3 vl in high resolution. This image has a resolution of (768, 136), which is sufficient in this case to perform OCR. It was planted in jpg, taken from PDF with 600 dpi.

I will use this image because it is an example of a real document, suitable for applying Qwen 3 vl on. In addition, I finished the image in its current state, so that I can feed the image in high resolution to qwen 3 vl on my local computer. Maintaining high resolution is important if you want to perform OCR on an image. I extracted a jpg from a PDF using 600 dpi. Normally, 300 dpi is enough for OCR, but I kept the DPI higher just to be sure, which works for this small image.

Prepare Qwen 3 VL

I need to import the following to run Qwen 3 VL:

torch
accelerate
pillow
torchvision
git+

You need to install the transformers from the source (github), as Qwen 3 VL is not yet available in the latest version of the transformers.

The following code loads the input, model, and processor, and creates the recognizer function:

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from PIL import Image
import os
import time

# default: Load the model on the available device(s)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3-VL-4B-Instruct", dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-4B-Instruct")



def _resize_image_if_needed(image_path: str, max_size: int = 1024) -> str:
    """Resize image if needed to a maximum size of max_size. Keep the aspect ratio."""
    img = Image.open(image_path)
    width, height = img.size
    
    if width <= max_size and height <= max_size:
        return image_path
    
    ratio = min(max_size / width, max_size / height)
    new_width = int(width * ratio)
    new_height = int(height * ratio)
    
    img_resized = img.resize((new_width, new_height), Image.Resampling.LANCZOS)
    
    base_name = os.path.splitext(image_path)[0]
    ext = os.path.splitext(image_path)[1]
    resized_path = f"{base_name}_resized{ext}"
    
    img_resized.save(resized_path)
    return resized_path


def _build_messages(system_prompt: str, user_prompt: str, image_paths: list[str] | None = None, max_image_size: int | None = None):
    messages = [
        {"role": "system", "content": [{"type": "text", "text": system_prompt}]}
    ]
    
    user_content = []
    if image_paths:
        if max_image_size is not None:
            processed_paths = [_resize_image_if_needed(path, max_image_size) for path in image_paths]
        else:
            processed_paths = image_paths
        user_content.extend([
            {"type": "image", "min_pixels": 512*32*32, "max_pixels": 2048*32*32, "image": image_path}
            for image_path in processed_paths
        ])
    user_content.append({"type": "text", "text": user_prompt})
    
    messages.append({
        "role": "user",
        "content": user_content,
    })
    
    return messages


def inference(system_prompt: str, user_prompt: str, max_new_tokens: int = 1024, image_paths: list[str] | None = None, max_image_size: int | None = None):
    messages = _build_messages(system_prompt, user_prompt, image_paths, max_image_size)
    
    inputs = processor.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_dict=True,
        return_tensors="pt"
    )
    inputs = inputs.to(model.device)
    
    start_time = time.time()
    generated_ids = model.generate(**inputs, max_new_tokens=max_new_tokens)
    generated_ids_trimmed = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )
    end_time = time.time()
    print(f"Time taken: {end_time - start_time} seconds")
    
    return output_text[0]

OCR

OCR is an activity that VLMs are highly trained in. For example, you can read the technical reports of the Qwen VL models, where they mention how the OCR data is part of the training set. To train a VLMS to do OCR you give the model a series of images, and the text contained in those images. The model then learned to extract text from the images.

I will apply OCR to the image with the prompt below, which is the same prompt that the Qwen team uses to perform OCR according to the QWEN 3 VL Cookbook.

user_prompt = "Read all the text in the image."

Now I will run the model. I called it a test image that works on it, because Example – Plan-Plan-Plan-Plan-Plan-Plan

system_prompt = """
You are a helpful assistant that can answer questions and help with tasks.
"""

user_prompt = "Read all the text in the image."
max_new_tokens = 1024

image_paths = ["example-doc-site-plan-cropped.jpg"]
output = inference(system_prompt, user_prompt, max_new_tokens, image_paths, max_image_size=1536)
print(output)

Which exit:

Plan- og
bygningsetaten

Dato: 23.01.2014
Bruker: HKN
Målestokk 1:500
Ekvidistanse 1m
Høydegrunnlag: Oslo lokal
Koordinatsystem: EUREF89 - UTM sone 32
© Plan- og bygningsetaten,
Oslo kommune
Originalformat A3

Adresse:
Camilla Colletts vei 15

Gnr/Bnr:
.

Kartet er sammenstilt for:
.

PlotID: / Best.nr.:
27661 /

Deres ref: Camilla Colletts vei 15

Kommentar:
Gjeldende kommunedelplaner:
KDP-BB, KDP-13, KDP-5

Kartutsnittet gjelder vertikalinvå 2.
I tillegg finnes det regulering i
følgende vertikalinvå:
(Hvis blank: Ingen øvrige.)

Det er ikke registrert
naturn mangfold innenfor
Se tegnforklaring på eget ark.

Beskrivelse:
NR:
Dato:
Revidert dato:

This extract appears in my tests, it is completely correct, and it covers all the text in the image, and removes all the relevant characters.

Disclosure of information

You can also perform data extraction using visual language models. This, for example, can be used to extract valuable metadata from images. You usually want to output this metadata in a JSON format, so it's easily recognizable and can be used for low-level tasks. In this example, I will issue:

  • Date – 23.01.2024 In this example
  • Address – Camilla Colletts Vei 15 In this example
  • GNR (Road Number) – Where in the test image is an empty field
  • Målestokk (scale) – 1:500

I use the following code:

user_prompt = """
Extract the following information from the image, and reply in JSON format:
{
    "date": "The date of the document. In format YYYY-MM-DD.",
    "address": "The address mentioned in the document.",
    "gnr": "The street number (Gnr) mentioned in the document.",
    "scale": "The scale (målestokk) mentioned in the document.",
}
If you cannot find the information, reply with None. The return object must be a valid JSON object. Reply only the JSON object, no other text.
"""
max_new_tokens = 1024

image_paths = ["example-doc-site-plan-cropped.jpg"]
output = inference(system_prompt, user_prompt, max_new_tokens, image_paths, max_image_size=1536)
print(output)

Which exit:

{
    "date": "2014-01-23",
    "address": "Camilla Colletts vei 15",
    "gnr": "15",
    "scale": "1:500"
}

The JSON object is in a valid format, and Qwen successfully extracted the date, address, and scale fields. However, Qwen actually brought back GNR. At first, when I saw this result, I thought that this was a point of submission, because the GNR field in the test image is empty. However, Qwen actually makes the natural assumption that the GNR is located at the address, which is correct in this case.

To be sure of its capabilities to answer nothing if it can't find anything, I asked Qwen to extract the BNR (build number), which is not available in this example. Running the code below:

user_prompt = """
Extract the following information from the image, and reply in JSON format:
{
    "date": "The date of the document. In format YYYY-MM-DD.",
    "address": "The address mentioned in the document.",
    "Bnr": "The building number (Bnr) mentioned in the document.",
    "scale": "The scale (målestokk) mentioned in the document.",
}
If you cannot find the information, reply with None. The return object must be a valid JSON object. Reply only the JSON object, no other text.
"""
max_new_tokens = 1024

image_paths = ["example-doc-site-plan-cropped.jpg"]
output = inference(system_prompt, user_prompt, max_new_tokens, image_paths, max_image_size=1536)
print(output)

I get:

{
    "date": "2014-01-23",
    "address": "Camilla Colletts vei 15",
    "Bnr": None,
    "scale": "1:500"
}

So as you can see, Qwen is able to notify us if information is missing from the document.

Visual language models 'below

I would also like to note that there are some problems with visual language models. The image I tested for OCR and Exposure Extraction is a simple image. In order to truly test the capabilities of Qwen 3, I would have to describe more challenging tasks, for example, extracting more text from a long document or making it extract more metadata fields.

Low currents and low currents with VLMs, from what I've seen, are:

  • Sometimes text is lost through OCR
  • Humility is slow

VLMS missing text when doing OCR is something I've seen a few times. When it does, VLM usually passes the document section and ignores the text entirely. This is naturally a very big problem, because it can miss very important text in good activities like keyword searches. Why this happens is a complex topic that's outside the scope of this article, but it's an issue you should be aware of if you're doing OCR with VLMS.

In addition, VLMS requires a lot of processing power. I'm running locally on my PC, although I'm working with a very small model. I started to face memory problems when I wanted to process an image with a size of 2048 × 2048, which is a problem if I want to limit the output of large documents. That way you can imagine how the resources work to run VLMS on it:

  • Multiple images at the same time (for example, processing a 10-page document)
  • Processing documents at higher resolutions
  • You use a large VLM, with many parameters

Lasting

In this article, I discussed the VLMS, where I started to discuss why we need a VLMS, highlighting how some tasks require text and the visual position of the text. In addition, I highlighted some tasks that you can do with VLMS and that the Qwen 3 VL was able to perform these tasks. I think the balance of opinion will be important and more important in the coming years. Until last year, almost all the focus was on pure text models. However, to get more powerful models, we need to use a visual approach, which is where I believe VLMS is most important.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button