Machine Learning

Illava in Budget: Multimodal Ai with limited resources

nimda June 17, 2025

0 1 7 minutes read

Illava in Budget: Multimodal Ai with limited resources

A few years ago, I have worked greatly in large languages, training, good order, quitting something else, because this is very requested on the market and the users. But I believe the llms work mainly in the text is just the beginning of Genai. Somewhere, everyone will want Ai visibleWhen the models can see, hear, feel, and reason each other, in a person's way.

So let's start with multimodianity. In this booklet, I launch LLAVA, the construction that can interpret both photos and text produce multimodal answers.

In this lesson, we will use a simple part of the right weight to use a letter of writing in free environmental environment as Google Colab.

The elements we will use are:

11 Clip-Vit B / 32 As a picture actor

21 Tinyllama-1.1B Like a language model

3️⃣ a 2-layer MLP adapter to close two

From Capes The direction of visual commands (neurip 2023)

Putting Time

Before we go into the code, let's set our environment.

Let's start installing the Datasette library.

!pip install -U datasets

Now we need to import the required packages to the face of face and the Netroch. This province provides for the previous trained models and the multiminary processing services.

import json
from pathlib import Path

import requests
import safetensors
import torch
from datasets import load_dataset
from huggingface_hub import hf_hub_download
from PIL import Image
from transformers import (
    AutoConfig,
    AutoTokenizer,
    LlamaTokenizer,
    LlavaConfig,
    LlavaForConditionalGeneration,
    LlavaProcessor,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
)
from transformers.models.clip.modeling_clip import CLIPVisionModel
from transformers.models.clip.image_processing_clip import CLIPImageProcessor

Download parts of the previously trained model

Our LAVA model will be built:

This page hf_hub_download Is the HUB testing to get trained instruments before training:

vision_backbone_name = "openai/clip-vit-base-patch32"
text_backbone_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

_ = hf_hub_download(
    vision_backbone_name, filename="pytorch_model.bin", local_dir="/content"
)
_ = hf_hub_download(
    text_backbone_name, filename="model.safetensors", local_dir="/content"
)

Statue

Then the new Lava model

Let us now establish a new LAVA model. As described above, the LLAVA model is created by two parts, an interface and decoder in a new text.

vision_config = AutoConfig.from_pretrained(vision_backbone_name).vision_config
text_config = AutoConfig.from_pretrained(text_backbone_name)

We specify the models to go back to Releva Config. Then we then we are real time model with LlavaForConditionalGeneration(llava_config).

llava_config = LlavaConfig(vision_config=vision_config, text_config=text_config)

model = LlavaForConditionalGeneration(llava_config).cuda()
model

Make other surgical services

Earlier, we said that we were building a LLAVA model by starting on the first trained ellam and previously trained image. Let's do Let's do!

The actual Yellava model begins from Clip-Vit L / 14 as well as a Dominate v1.5 7b. Making things are very concerned about the services provided by the free Google Colab program, we will use a Clip-Vit B / 16 as well as a Tinyllama 1.1B.

Only a part that will train the milp adapter with homosexuals between them.

To use Clip and Tinyllama models, we need to upload their pre-train equipment. But these metals can enter different formats are like .safetensiors or .bin. Fooch_weqiss activity is getting with us. Checking the file type and costs the correct upload work.

def load_weights(path_to_weights: str):
    if path_to_weights.endswith(".safetensors"):
        return load_safetensors_weights(path_to_weights)
    elif path_to_weights.endswith(".bin"):
        return load_bin_weights(path_to_weights)
    else:
        raise ValueError(f"Unsupported weights file: {path_to_weights}")

def load_bin_weights(path_to_weights: str):
    return torch.load(path_to_weights, weights_only=True)

def load_safetensors_weights(path_to_weights: str):
    return safetensors.torch.load_file(path_to_weights)

vision_backbone_state_dict = load_weights("/content/pytorch_model.bin")
text_backbone_state_dict = load_weights("/content/model.safetensors")

Add Backbile Bockse Backbite Backbite Rounds to Model 💉

The following lines load weights in the form of model. We set Firmly = lies To change because it allows us to skip any instruments that do not match the expected model structure.

incompatible_keys = model.vision_tower.load_state_dict(
    vision_backbone_state_dict, strict=False
)

assert len(incompatible_keys.missing_keys) == 0, (
    f"Missing keys in state dict: {incompatible_keys.missing_keys}"
)

incompatible_keys.unexpected_keys

Put the text instruments listed in model 💉

Similar to understand as before, but also model of text.

incompatible_keys = model.language_model.load_state_dict(
    text_backbone_state_dict, strict=True
)

Disconnect the previously trained components ❄️

We want now to free the observed and Scriptural models, because we do not want to renew their weights while training.

We will only train the small adapter (MLP connecting the idea and language), which is very simple and quick to train.

_ = model.vision_tower.requires_grad_(False)
_ = model.language_model.requires_grad_(False)

# Then we define a helper function to count model parameters

def count_parameters(model, trainable_only=False):
    return sum(
        p.numel()
        for p in model.parameters()
        if not trainable_only or p.requires_grad
    )

print(f"Total parameters: {count_parameters(model)}")
print(f"Trainable parameters: {count_parameters(model, trainable_only=True)}")

Processor

Before feeding some text on our model, we need to change the words into numbers. This is what the required Tokenzer.

tokenizer = LlamaTokenizer.from_pretrained(
    text_backbone_name, additional_special_tokens=["", ""]
)
tokenizer.pad_token_id = 32001

Below the format we will use to discuss with our LAVA model.

The first part is what is called Fast systemwhich contains standard guidelines that model should respond to the user.

The second part is a temporary temple (basically the code) determines how the conversation is provided based on formal installation (see Example below).

LLAVA_CHAT_TEMPLATE = (
    "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. "
    "{% for message in messages %}{% if message['role'] == 'user' %}USER: {% else %}ASSISTANT: {% endif %}{% for item in message['content'] %}{% if item['type'] == 'text' %}{{ item['text'] }}{% elif item['type'] == 'image' %}{% endif %}{% endfor %}{% if message['role'] == 'user' %} {% else %}{{eos_token}}{% endif %}{% endfor %}"
)
tokenizer.chat_template = LLAVA_CHAT_TEMPLATE

sample_messages = [
    {
        "content": [
            {
                "index": 0,
                "text": None,
                "type": "image"
            },
            {
                "index": None,
                "text": "nWhat potential activities might be popular at this location?",
                "type": "text"
            }
        ],
        "role": "user"
    },
    {
        "content": [
            {
                "index": None,
                "text": (
                    "At this location, with a sandy path leading to the ocean where multiple boats, including "
                    "sailboats, are moored, popular activities might include boating, sailing, swimming, and "
                    "beachcombing. Additionally, the sandy path and shoreline provide an ideal setting for leisurely "
                    "strolls and picnics, while the ocean view offers a serene environment for relaxation and "
                    "photography. Depending on the specific area and available facilities, other water sports such as "
                    "kayaking, paddleboarding, and snorkeling could also be prevalent."
                ),
                "type": "text"
            }
        ],
        "role": "assistant"
    }
]

Let's use chat template in our samples.

tokenizer.apply_chat_template(
    sample_messages, tokenize=False, add_generation_prompt=False
)

During this time we set our Tokozer and we have downloaded the vision model. We include them together in one unity processor.

processor = LlavaProcessor(
    image_processor=CLIPImageProcessor.from_pretrained(vision_backbone_name),
    tokenizer=tokenizer,
    patch_size=model.config.vision_config.patch_size,
)
processor.chat_template = LLAVA_CHAT_TEMPLATE

As we add special tokens like including in our Tlonsizer earlier, the model needs Adjust your vocabulary to understand them

model.resize_token_embeddings(len(tokenizer), pad_to_multiple_of=8)

Dataset average

Let's download the data data we will use in the face of face.

The data-containing data samples – text text couples are public and can be found here.

train_dataset = load_dataset(
    "HuggingFaceH4/llava-instruct-mix-vsft", split="train", streaming=True
)

What do our examples of training look like?

next(iter(train_dataset))

How do we build examples of examples?

The following work is taking examples of the image raw text and converting them into the correct installation. Organizing messages using a chat template, processing texts and image with LlavaProcessor We previously explained, and we built the right training labels while ignoring the padding.

def get_data_collator(processor, ignore_index):
    def collate_examples(examples):
        # Extract texts and images from the raw examples
        texts = []
        images = []
        for example in examples:
            messages = example["messages"]
            text = processor.tokenizer.apply_chat_template(
                messages, tokenize=False, add_generation_prompt=False
            )
            texts.append(text)
            images.append(example["images"][0])

        # Process the inputs (tokenize text and transform images)
        batch = processor(texts, images, return_tensors="pt", padding=True)

        # Create labels
        labels = batch["input_ids"].clone()
        if processor.tokenizer.pad_token_id is not None:
            labels[labels == processor.tokenizer.pad_token_id] = ignore_index
        batch["labels"] = labels

        return batch

    return collate_examples

# NOTE: this does a bit more than a collate function should...

To prepare a game

Let's keep clear of training arguments, including batch size, learning rating, learning steps, and using mixed clarification (FP16) at speed. We also avoid saving test areas to keep things simple. Then we wrapping everything in Seq2SeqTrainerPassing the model, data, and our custom collator for the text of the image text.

args = Seq2SeqTrainingArguments(
    output_dir="/content/training_output",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    max_steps=350,
    lr_scheduler_type="cosine_with_min_lr",
    lr_scheduler_kwargs={"min_lr": 2e-5},
    warmup_ratio=0.05,
    logging_strategy="steps",
    logging_steps=5,
    fp16=True,
    remove_unused_columns=False,  # Important!
    optim="adamw_torch",
    report_to="none",
    save_strategy="no",  # let's not save the checkpoint to disk, otherwise it'll take 5 mins
)

trainer = Seq2SeqTrainer(
    model=model,
    args=args,
    data_collator=get_data_collator(
        processor, ignore_index=model.config.ignore_index,
    ),
    train_dataset=train_dataset,
)

trainer.train()

Miss

Recognition is that to make sure that intervention works as expected that you should use the highest and train models for a long time.

We will use this step to:

conversation = [
    {
        "content": [
            {
                "type": "image"
            },
            {
                "text": "nWhat is represented in the image?",
                "type": "text"
            }
        ],
        "role": "user"
    }
]

At this cell block block as an example, we upload a picture from the URL and format the conversation using the chat text. The processor turns both of the clubs. Then we move input to the model device and produce a response, allowing the model is described photo based on user's trial.

image_url = "

inputs_for_generation = processor(
    images=Image.open(requests.get(image_url, stream=True).raw),
    text=processor.apply_chat_template(conversation, add_generation_prompt=True),
    return_tensors="pt",
)

inputs_for_generation = inputs_for_generation.to(device=model.device)
output = trainer.model.generate(
    **inputs_for_generation, max_new_tokens=200, do_sample=False
)

print(processor.decode(output[0], skip_special_tokens=True))

Extensions and Development

Use a larger image encoder (eg A large vit-vit) and llm (eg LLAMA 3.1 8BSelected
The train for a long time. It takes time model to find out how you can follow the instructions in front of the symbols of photos
Follow the training process of many categories received by original LLAVA
- Section 1: Pre-alignment Test Training -> Train a model in one-turn information, where it is requested to explain a brief picture. Image encoder and llm is frozen
- Section 2: The best end of the end -> Train model in many educational information. Only image encoder is frozen

Active Demo: HuggaP.co/spaces/badayvedat/llava

Store

I think this little project is interesting better understanding what multimorder models are like the work of LAVA. Whether we have used small models, the main idea is like: Combine the language and language in one program that can understand the pictures and talk about them.

Of course, the results found in this toy allowance is definitely wrong; There is a lot of space to improve. But to make yellava work in a place with a very challenging resources

Follow me on TDS if you like this article! 😁

💼 Lickimin ️ | 🐦 x (Twitter) | 💻 website

Source link

nimda June 17, 2025

0 1 7 minutes read