Machine Learning

Promotes the models of the vision language. The test strategies to send VLMS | through anand Sebramannian | Jan, 2025

Generally, the acquisition model is trained with a limited vocabulary, which means only a predefined set of sections of the item. However, in our pipeline, because we cannot forecast for advance what things will appear in the picture, we need a visible and able to recognize the various classes of the item. To achieve this, using the Owl-Vit model [11]a model of an open item. This model requires motivation of the text that clarifies the acquisitions.

Another challenge that requires concerning you get a higher opinion of things in the picture before using the Owl-Vit model, as it requires the text explaining things. That's where the VLMS will help you! First, we forward the image to VLM with a wish to identify high quality items in the picture. These discovered items are used as the motivation of the text, as well as the picture, with Owl-Vit Model to generate detection. Next, we plan to be detected as boxing boxes in the same picture and pass this renewed image in VLM, encourage you to promote the defenses. The soft code is partly modified from [12].

# Load model directly
from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection

processor = AutoProcessor.from_pretrained("google/owlvit-base-patch32")
model = AutoModelForZeroShotObjectDetection.from_pretrained("google/owlvit-base-patch32")

I see things in each picture using VLM:

IMAGE_QUALITY = "high"
system_prompt_object_detection = """You are provided with an image. You must identify all important objects in the image, and provide a standardized list of objects in the image.
Return your output as follows:
Output: object_1, object_2"""

user_prompt = "Extract the objects from the provided image:"

detected_objects = process_images_in_parallel(image_paths, system_prompt=system_prompt_object_detection, user_prompt=user_prompt, model = "gpt-4o-mini", few_shot_prompt= None, detail=IMAGE_QUALITY, max_workers=5)

detected_objects_cleaned = {}

for key, value in detected_objects.items():
detected_objects_cleaned[key] = list(set([x.strip() for x in value.replace("Output: ", "").split(",")]))

The items found now have passed as a motivational text in the Owl-Vit model to find photographic predictions. I use the help function to predetermine binding boxes in the picture, and then edit the tie box to the original picture.

from PIL import Image, ImageDraw, ImageFont
import numpy as np
import torch

def detect_and_draw_bounding_boxes(
image_path,
text_queries,
model,
processor,
output_path,
score_threshold=0.2
):
"""
Detect objects in an image and draw bounding boxes over the original image using PIL.

Parameters:
- image_path (str): Path to the image file.
- text_queries (list of str): List of text queries to process.
- model: Pretrained model to use for detection.
- processor: Processor to preprocess image and text queries.
- output_path (str): Path to save the output image with bounding boxes.
- score_threshold (float): Threshold to filter out low-confidence predictions.

Returns:
- output_image_pil: A PIL Image object with bounding boxes and labels drawn.
"""
img = Image.open(image_path).convert("RGB")
orig_w, orig_h = img.size # original width, height

inputs = processor(
text=text_queries,
images=img,
return_tensors="pt",
padding=True,
truncation=True
).to("cpu")

model.eval()
with torch.no_grad():
outputs = model(**inputs)

logits = torch.max(outputs["logits"][0], dim=-1) # shape (num_boxes,)
scores = torch.sigmoid(logits.values).cpu().numpy() # convert to probabilities
labels = logits.indices.cpu().numpy() # class indices
boxes_norm = outputs["pred_boxes"][0].cpu().numpy() # shape (num_boxes, 4)

converted_boxes = []
for box in boxes_norm:
cx, cy, w, h = box
cx_abs = cx * orig_w
cy_abs = cy * orig_h
w_abs = w * orig_w
h_abs = h * orig_h
x1 = cx_abs - w_abs / 2.0
y1 = cy_abs - h_abs / 2.0
x2 = cx_abs + w_abs / 2.0
y2 = cy_abs + h_abs / 2.0
converted_boxes.append((x1, y1, x2, y2))

draw = ImageDraw.Draw(img)

for score, (x1, y1, x2, y2), label_idx in zip(scores, converted_boxes, labels):
if score < score_threshold:
continue

draw.rectangle([x1, y1, x2, y2], outline="red", width=3)

label_text = text_queries[label_idx].replace("An image of ", "")

text_str = f"{label_text}: {score:.2f}"
text_size = draw.textsize(text_str) # If no font used, remove "font=font"
text_x, text_y = x1, max(0, y1 - text_size[1]) # place text slightly above box

draw.rectangle(
[text_x, text_y, text_x + text_size[0], text_y + text_size[1]],
fill="white"
)
draw.text((text_x, text_y), text_str, fill="red") # , font=font)

img.save(output_path, "JPEG")

return img

for key, value in tqdm(detected_objects_cleaned.items()):
value = ["An image of " + x for x in value]
detect_and_draw_bounding_boxes(key, value, model, processor, "images_with_bounding_boxes/" + key.split("/")[-1], score_threshold=0.15)

Pictures that have found structured are now referred to the VLM to be patient:

IMAGE_QUALITY = "high"
image_paths_obj_detected_guided = [x.replace("downloaded_images", "images_with_bounding_boxes") for x in image_paths]

system_prompt="""You are a helpful assistant that can analyze images and provide captions. You are provided with images that also contain bounding box annotations of the important objects in them, along with their labels.
Analyze the overall image and the provided bounding box information and provide an appropriate caption for the image.""",

user_prompt="Please analyze the following image:",

obj_det_zero_shot_high_quality_captions = process_images_in_parallel(image_paths_obj_detected_guided, model = "gpt-4o-mini", few_shot_prompt= None, detail=IMAGE_QUALITY, max_workers=5)

Received obtained from the guidance of the object of the item acquisition. Pictures in this picture is taken by Josh Frettetate in UnderCetes and Alexander Zaysev in Underwech (Photography by the writer)

In this project, you are given a simple state of photos we use, a place of things does not add any important information to the VLM. However, the targeted learner may be a powerful tool for the most complex tasks, such as the understanding of the document, where the formation information can be successfully provides for the VLM. Additionally, the destruction of the Semantic can be employed as a directive to provide VFM logging mask.

VLMS is a powerful tool for Arsenal of Ai Engineers and various problems that require a combination of text and text skills. In this article, I test the strategies to promote VLMS context to make good use of these activities such as the photo insert image. This is not able to know a complete or full list of navigation strategies. One thing that has increased further development in Genai unlimited power of new creative and new ways to accelerate and guide VLMs and VLMS resolving services.

[1] J. Chen, H. Guo, K. is, B. Li. Summit of 2022 IEE / CVF on Computer Vision and Pattern Recond (CVPR)New Orleans, La, USA, 2022, PP. 18009-18019, DOI: 10.1109 / CVPR52688.2022.01750.

[2] Luo, Z., Xi, Y., Zhang, R., & Ma, J. (2022). A frustrating way to use the end of the image.

[3] Jean-Baptiste Alayrac, Jean Donahue, Pauline Luc Luc luc, Anaine Miech, Iain Montooei, Karianne Monteiro, Jacob Monick, Sebastan Borgeaud, and Aima Nomatzadeh, Sahand Sharifzadeh, Micand Sharifzadeh, Cren Zissel, Karen Zimser, Karen Zimonan, Karen Basman, Karen Basman, and Karen Basman, and Charen n simon, and karen simony, and Karen Simony, Karen Simony, and Karen Simony, Karen Simony, and Karen Simony, and Karen Simony. 2022. Flamingo: Avaluable language model for a few shot. In the process of the 36th International Conference through the Neural Information System Programs (NPS '22). Curran Associates Inc., Red Hook, NY, USA, Section 1723, 23716-23736.

[4] HTTPS: //huggApp.Co/blog/vion_SweAge_verraing

[5] Piyush Sharma, Nan Ding, Sebastian Goodman, and Rado Soricut. 2018 among An annual meeting of the 56th meeting of the Computitional Language (Volume 1: Long papers)PAGES 2556-2565, Melbourne, Australia. Computational Society.

[6] https://platform.Opanai.com/doc/guduedes/vudes/vision

[7] Chin-yuw Lin. 2004. Rouge: Automatic summarian test package. In Branch SummaryPages 74-81, Barcelona, ​​Spain. Computational Society.

[8]HTTPS:

[9] Wei, J., Wang, Wang, X., Schuurmans, D., Bosma, M. Xia, F., Chi, E. (2022). Chain-of-Refing Delicits relieving expressing large-language models. Development in Neural Information Programs, 3524824-24837.

[10] PTPS:

[11] Mathim Gruthenko, Austin Stormenko, Maxim Neumann, Alexey Dosovitskiy, Alavindh Mahendran, the Unanurag Arnab, Xiaua Suhai, Thomas Zhai, Neil Houlsby. 2022. Simple detection of vocabulary. In the computer vision – Eccv 2022: 17 Europe, Tel Aviv conference, Israel, on October 23-27, 2022 charges, cases, Heidelberg, 728-755. https://doi.org10.1007/978-3-031-20080-9_42

[12]HTTPS:

[13]

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button