Multimodal-install Multimodal installations | by Shaw Talebi | Jan, 2025

nimda January 31, 2025

0 6 6 minutes read

Multimodal-install Multimodal installations | by Shaw Talebi | Jan, 2025

The first step (and most important) of any good order of data collection. Here, I'm output titles – icon-icons from my station in the 2 step process.

First, I used YouTube's Search API to Uninstall video IDs For all videos at my station. Second, I used the YouTube Video API to Release title and an urgle URL Each of my videos of the tall Forms (ie is longer than 3 min).

# imports
from top_secret import my_key
import requests
from isodate import parse_durationimport pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from datasets import DatasetDict, Dataset

channel_id = 'UCa9gErQ9AE5jT2DZLjXBIdA' # my YouTube channel ID
page_token = None # initialize page token
url = ' # YouTube search API # extract video data across multiple search result pages
video_id_list = []
while page_token != 0:
params = {
"key": my_key, 
'channelId': channel_id, 
'part': ["snippet","id"], 
'order': "date", 
'maxResults':50, 
'pageToken': page_token
}
response = requests.get(url, params=params)
for raw_item in dict(response.json())['items']:
# only execute for youtube videos
if raw_item['id']['kind'] != "youtube#video":
continue
# grab video ids
video_id_list.append(raw_item['id']['videoId'])
try:
# grab next page token
page_token = dict(response.json())['nextPageToken']
except:
# if no next page token kill while loop
page_token = 0

Note that you will need the YouTube API key to use the above Python code, which you can do to use the Google Cloud console. Synchronization of your channel, you just need to change Channel_ Variable.

# extract video titles and thumbnails
url = "https://www.googleapis.com/youtube/v3/videos"
video_data_list = []for video_id in video_id_list:
params = {
"part": ["snippet","contentDetails"],
"id": video_id,  
"key": my_key,  
}
response = requests.get(url, params=params)
raw_dict = dict(response.json())['items'][0]
# only process videos longer than 3 minutes
iso_duration = raw_dict['contentDetails']["duration"]
if parse_duration(iso_duration).total_seconds() < 180:
continue
# extract video data
video_data = {}
video_data['video_id'] = video_id
video_data['title'] = raw_dict['snippet']['title']
video_data['thumbnail_url'] = raw_dict['snippet']['thumbnails']['high']['url']
# append data to list
video_data_list.append(video_data)

As an extra step, an created in the negative ribs of the title. We can use the training process so that we do not just guide the model for dynamic examples that should be close to (which means they hold pairs), Ie are very bad).

To do this, I made the similarities between all two of the topic I used a transformer library. Then with each good pair, I likened the same same title as an unpleasant example (to ensure that there are no doubles).

# store data in dataframe
df = pd.DataFrame(video_data_list)# Load the model
model = SentenceTransformer("all-mpnet-base-v2")
# Encode all titles
embeddings = model.encode(df['title'].to_list())
# compute similarities
similarities = model.similarity(embeddings, embeddings)
# match least JDs least similar to positive match as the negative match
similarities_argsorted = np.argsort(similarities.numpy(), axis=1)
negative_pair_index_list = []
for i in range(len(similarities)):
# Start with the smallest similarity index for the current row
j = 0
index = int(similarities_argsorted[i][j])
# Ensure the index is unique
while index in negative_pair_index_list:
j += 1  # Move to the next smallest index
index = int(similarities_argsorted[i][j])  # Fetch next smallest index
negative_pair_index_list.append(index)
# add negative pairs to df
df['title_neg'] = df['title'].iloc[negative_pair_index_list].values

Finally, I created a A valid partition of the train and pressed the dataset on the face of a hug to kiss.

# Shuffle the dataset
df = df.sample(frac=1, random_state=42).reset_index(drop=True)# Split into train, validation, and test sets
train_frac = 0.7
valid_frac = 0.15
test_frac = 0.15
# define train and validation size
train_size = int(train_frac * len(df))
valid_size = int(valid_frac * len(df))
# create train, validation, and test datasets
df_train = df[:train_size]
df_valid = df[train_size:train_size + valid_size]
df_test = df[train_size + valid_size:]
# Convert the pandas DataFrames back to Hugging Face Datasets
train_ds = Dataset.from_pandas(df_train)
valid_ds = Dataset.from_pandas(df_valid)
test_ds = Dataset.from_pandas(df_test)
# Combine into a DatasetDict
dataset_dict = DatasetDict({
'train': train_ds,
'valid': valid_ds,
'test': test_ds
})

# push data to hub
dataset_dict.push_to_hub("shawhin/yt-title-thumbnail-pairs")

Although we have all the details we need to plan well, it is no longer appropriate training format. Directly, need Convert the URLs of our picture on pil pil items including Edit our data to (Anchor, good, bad triplets), Yes, thumbnail, its corresponding title and a bad topic, respectively.

We can process all three data products (which means a train, valid, and assessment) as the following process is using the face library.

from PIL import Image# load dataset
dataset = load_dataset("shawhin/yt-title-thumbnail-pairs")
# define preprocessing function
def preprocess(batch):
"""
Preprocessing data without augmentations for test set
"""
# get images from urls
image_list = [Image.open(requests.get(url, stream=True).raw) 
for url in batch["thumbnail_url"]]
# return columns with standard names
return {
"anchor": image_list,       
"positive": batch["title"],  
"negative": batch["title_neg"]
}
# remove columns not relevant to training
columns_to_remove = [col for col in dataset['train'].column_names 
if col not in ['anchor', 'positive', 'negative']]
# apply transformations
dataset = dataset.map(preprocess, batched=True, 
remove_columns=columns_to_remove)

It is important that we order our columns like (anchor, beautiful, bad) triplets because This is the expected way of being lost We will use during training (I have a hard teacher).

Training includes doing well model parameters to reduce the work of loss. However, this number (which means a different loss) rarely be useful To check model's performance to a low job (eg comparing titles to icons).

More comprehensive value, in this case, is the right model skill Match the given icon in the correct article Between the candidates. This is portrayed Remember @ 1.

We can use the inspector that corresponds to the TransformMers library to combine this metric. As the code lasts longer, it will not attach it here, but the student you want to get well in the cell 12 of this booklet.

# function to create new evaluator given data split
def create_recall_evaluator(set_name, k=1):
"""
Create triplet evaluator for "train", "valid", or "test" split
"""return ImageTextRetrievalEvaluator(
images=dataset[f"{set_name}"]["anchor"],
texts=dataset[f"{set_name}"]["positive"],
name=f"yt-title-thumbnail-{set_name}",
k=k
)
# Create new evaluator with Recall@k
evaluator_recall_train = create_recall_evaluator("train", k=1)
evaluator_recall_valid = create_recall_evaluator("valid", k=1)
print("Train:", evaluator_recall_train(model))
print("Valid:", evaluator_recall_valid(model))
# >> Train: {'yt-title-thumbnail-train_Recall@1': 0.660377358490566}
# >> Valid: {'yt-title-thumbnail-valid_Recall@1': 0.6363636363636364}

We can see the model already having a good box out of the box, with appropriate topics are compared to 66% of the time.

They are 3 things that are important We have to do before training the model. That is, choose which training parameters, select Losing work, and set up hyperparameter.

PARMENTS INCLUDED

The main limit of this project is to send only 76 videos (such as writing this). With confirmation and testing for testing, these leaves only 53 examples of training.

With a few training examples, Reducing the number of parameters we train is a good idea. In this case, I only train the final layer of model, which sets up the text and the picture to emboderate in the shared space. This is about 1M Parameters Total.

# import model
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/clip-ViT-L-14")# pick specific layers to train (note: you can add more layers to this list)
trainable_layers_list = ['projection']
# Apply freezing configuration
for name, param in model.named_parameters():
# freeze all params
param.requires_grad = False
# unfreeze layers in trainable_layers_list
if any(layer in name for layer in trainable_layers_list):
param.requires_grad = True

# Count total and trainable parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
print(f"% of trainable parameters: {100*trainable_params/total_params:.2f}%")
# >> Total parameters: 427,616,513
# >> Trainable parameters: 1,376,256
# >> % of trainable parameters: 0.32%

Work of loss

Here, I use a loss of some bad degree from the transformers library (working with one negative negatives). Works in Increase to match the match between two good there Reducing similarities between two bad. Here is the job of losing what looks like one of the negative cases [2].

Mulitple and Watatics activity loss (only 1 negative). Photo by author.

from sentence_transformers.losses import MultipleNegativesRankingLoss# define loss
loss = MultipleNegativesRankingLoss(model)

Hyperameters

For hyperparemers, I test a few of the hand choices I chose to select the best loss of verification and remembering @ 1 functionality. Here are the last decisions.

from sentence_transformers import SentenceTransformerTrainingArguments# hyperparameters
num_epochs = 2
batch_size = 16
lr = 1e-4
finetuned_model_name = "clip-title-thumbnail-embeddings"
train_args = SentenceTransformerTrainingArguments(
output_dir=f"models/{finetuned_model_name}",
num_train_epochs=num_epochs,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
learning_rate=lr,
# Evaluation settings
eval_strategy="epoch",
eval_steps=1,
logging_steps=1,
)

With our loss and hyperpasemers described, we can train the model using enencecttetransformerer ().

from sentence_transformers import SentenceTransformerTrainertrainer = SentenceTransformerTrainer(
model=model,
args=train_args,
train_dataset=dataset["train"],
eval_dataset=dataset["valid"],
loss=loss,
evaluator=[evaluator_recall_train, evaluator_recall_valid],
)
trainer.train()

Exemplary training is an Visible process Where you can check multiple models for different decisions for qualified parameters, losses, and hyperparemeter.

However, I commend Keeping these tests are as simple as possible. If you find yourself spending a lot of TWEKING ASSY training to get your model to convert, perhaps there is a basic idea of your data (talking about your experience).

As the last step, we can check it to remember model @ 1 Score on the test set. This information was not used for training or hyperparameter training, so it gives us unconscourse of model.

evaluator_recall_test = create_recall_evaluator("test")print("Train:", evaluator_recall_train(model))
print("Valid:", evaluator_recall_valid(model))
print("Test:", evaluator_recall_test(model))
# >> Train: {'yt-title-thumbnail-train_Recall@1': 0.8490566037735849}
# >> Valid: {'yt-title-thumbnail-valid_Recall@1': 0.9090909090909091}
# >> Test: {'yt-title-thumbnail-test_Recall@1': 0.75}

We see that the model does well in all three datasets with 75% remember @ 1 on the test set. In other words, 75% of the time, the model is well matched with the icon provided in its original top. Additionally, remembering to get verification data increases by 27%!

Multimodal removal models, such as the clip, opened about 0 countless cases here, we realized how we can do so well to adapt to a special background (which means my YouTube articles and icons).

Although clip is a small model for modern standards (~ 500m parameters) and our training data was smaller, The last model has shown strong performance to this activity. This highlights the strength of good planning.

If you have questions or comments to come, let me know on the comments 🙂

More in multimodal Ai 👇