Introducing Gemini Embedded 2 Preview | About Data Science

preview version of its latest embedding model. This model is notable for one main reason. It can embed text, PDFs, images, audio, and video, making it a one-stop shop for embedding almost anything you'd like to throw at it.
If you're new to embedding, you might be wondering what all the fuss is about, but it turns out that embedding is one of the cornerstones of advanced generation retrieval or RAG, as it's known. Consequently, RAG is one of the most important applications of modern artificial intelligence processing.
A quick summary of RAG and embedding
RAG is a method of grouping, coding and storing searchable information using common functions that match search terms to embedded information. The encoding part turns whatever you're searching for into a series of numbers called vectors – this is what embedding does. The vectors (embeddings) are then stored in the vector database.
When a user enters a search term, it is recoded as an embedding, and the resulting vectors are compared to the contents of a vector database, usually using a process called cosine matching. The closer the search term vectors are to the information components in the vector store, the more relevant the search terms are to those components of the database. Large language models can interpret all this and find and display the most relevant parts to the user.
There are a number of other factors surrounding this, such as how the input data should be divided or fragmented, but embedding, storage, and retrieval are the main aspects of RAG processing. To help you visualize, here is a simplified schematic of the RAG process.
So, what's so special about Gemini embedding?
Okay, now that we know how important RAG embedding is, why is Google's new Gemini embedding model such a big deal? This is easy. Traditional embedding models – with a few exceptions – are limited to text, PDF, and other document types, and perhaps images for compression.
What Gemini now offers is a true inclusion of many types of embedding. That means text, PDFs and documents, images, audio again video. As with the preview embedding model, there are some size limitations on the input right now, but hopefully you can see the journey and how useful this can be.
Input limits
I mentioned that there are limitations to what we can put in the new Gemini embedding model. That's right:
- Text: Up to 8192 input tokens, which is 6000 words
- Photos: Up to 6 images per request, supports PNG and JPEG formats
- Videos: 2 minute video limit in MP4 and MOV formats
- Sound: Maximum time 80 seconds, supports MP3, WAV.
- Documents: Up to 6 pages long
Okay, time to see the new embedding model in action with some Python code examples.
Setting up the development environment
To start, let's set up a common development environment to keep our projects separate. I'll use the UV tool for this, but feel free to use any methods you're familiar with.
$ uv init embed-test --python 3.13
$ cd embed-test
$ uv venv
$ source embed-test/bin/activate
$ uv add google-genai jupyter numpy scikit-learn audioop-lts
# To run the notebook, type this in
$ uv run jupyter notebook
You'll also need a Gemini API key, which you can find on Google's AI Studio home page.
Look for the Get API Key link near the bottom left of the screen after signing in. Be careful as you will need it later.
Please note, other than being a user of their products, I have no affiliation or association with Google or any of its subsidiaries.
Set Code
I won't talk too much about embedding text or PDF documents, as these are more specific and covered elsewhere. Instead, we'll look at embedding images and audio, which are less common.
This is the setup code, common to all our examples.
import os
import numpy as np
from pydub import AudioSegment
from google import genai
from google.genai import types
from sklearn.metrics.pairwise import cosine_similarity
from IPython.display import display, Image as IPImage, Audio as IPAudio, Markdown
client = genai.Client(api_key='YOUR_API_KEY')
MODEL_ID = "gemini-embedding-2-preview"
Example 1 – Embedding images
In this example, we will embed 3 images: one of a ginger cat, one of a Labrador, and one of a yellow dolphin. We will then set a series of questions or sentences, each very specific or related to one of the images, and see if the model can choose the most appropriate image for each question. It does this by combining the points of similarity between the question and each picture. The higher this score, the more important the question is in the picture.
Here are the images I use.

So, I have two questions and two sentences.
- What animal is yellow
- What could be called a Rover
- Something is wrong here
- Purrrfect picture
# Some helper function
#
# embed text
def embed_text(text: str) -> np.ndarray:
"""Encode a text string into an embedding vector.
Simply pass the string directly to embed_content.
"""
result = client.models.embed_content(
model=MODEL_ID,
contents=[text],
)
return np.array(result.embeddings[0].values)
# Embed an image
def embed_image(image_path: str) -> np.ndarray:
# Determine MIME type from extension
ext = image_path.lower().rsplit('.', 1)[-1]
mime_map = {'png': 'image/png', 'jpg': 'image/jpeg', 'jpeg': 'image/jpeg'}
mime_type = mime_map.get(ext, 'image/png')
with open(image_path, 'rb') as f:
image_bytes = f.read()
result = client.models.embed_content(
model=MODEL_ID,
contents=[
types.Part.from_bytes(data=image_bytes, mime_type=mime_type),
],
)
return np.array(result.embeddings[0].values)
# --- Define image files ---
image_files = ["dog.png", "cat.png", "dolphin.png"]
image_labels = ["dog","cat","dolphin"]
# Our questions
text_descriptions = [
"Which animal is yellow",
"Which is most likely called Rover",
"There's something fishy going on here",
"A purrrfect image"
]
# --- Compute embeddings ---
print("Embedding texts...")
text_embeddings = np.array([embed_text
print("Embedding images...")
image_embeddings = np.array([embed_image(f) for f in image_files])
# Use cosine similarity for matches
text_image_sim = cosine_similarity(text_embeddings, image_embeddings)
# Print best matches for each text
print("nBest image match for each text:")
for i, text in enumerate(text_descriptions):
# np.argmax looks across the row (i) to find the highest score among the columns
best_idx = np.argmax(text_image_sim[i, :])
best_image = image_labels[best_idx]
best_score = text_image_sim[i, best_idx]
print(f" "{text}" => {best_image} (score: {best_score:.3f})")
Here is the output.
Embedding texts...
Embedding images...
Best image match for each text:
"Which animal is yellow" => dolphin (score: 0.399)
"Which is most likely called Rover" => dog (score: 0.357)
"There's something fishy going on here" => dolphin (score: 0.302)
"A purrrfect image" => cat (score: 0.368)
Not too shabby. This model came up with the answers I was going to give. What about you?
Example 2 — Embedding sound
For the audio, I used the voice of a man describing a fishing trip where he saw a bright yellow dolphin. Click below to hear the full audio. About 37 seconds long.
If you don't want to listen, here is the full text.
Hello, my name is Glen, and I want to tell you about an interesting sight I saw last Tuesday afternoon while fishing in the sea with some friends. It was a warm day with a yellow sun in the sky. We were fishing for Tuna and had no luck catching anything. Boy, we must have spent the better part of 5 hours there. Therefore, we became very bitter as we returned to dry land. But then, suddenly, and I swear it's not a lie, we saw a school of dolphins. Not only that, but one of them was bright yellow in color. We have never seen anything like this in our lives, but I can tell you all thoughts of a bad day of fishing went out the window. It was exciting.
Now, let's see if we can narrow down when the speaker talks about seeing a yellow dolphin.
Generally, when dealing with embeddings, we are only interested in the general properties, ideas, and concepts contained in the source information. If we want to narrow down certain features, such as where in the audio file a certain phrase appears or where in the video a certain action or event occurs, this is a more complicated task. To do that in our example, we must first split the audio into small pieces before embedding each piece. We then perform the same search on each embedded clip before generating our final response.
# --- HELPER FUNCTIONS ---
def embed_text(text: str) -> np.ndarray:
result = client.models.embed_content(model=MODEL_ID, contents=[text])
return np.array(result.embeddings[0].values)
def embed_audio(audio_path: str) -> np.ndarray:
ext = audio_path.lower().rsplit('.', 1)[-1]
mime_map = {'wav': 'audio/wav', 'mp3': 'audio/mp3'}
mime_type = mime_map.get(ext, 'audio/wav')
with open(audio_path, 'rb') as f:
audio_bytes = f.read()
result = client.models.embed_content(
model=MODEL_ID,
contents=[types.Part.from_bytes(data=audio_bytes, mime_type=mime_type)],
)
return np.array(result.embeddings[0].values)
# --- MAIN SEARCH SCRIPT ---
def search_audio_with_embeddings(audio_file_path: str, search_phrase: str, chunk_seconds: int = 5):
print(f"Loading {audio_file_path}...")
audio = AudioSegment.from_file(audio_file_path)
# pydub works in milliseconds, so 5 seconds = 5000 ms
chunk_length_ms = chunk_seconds * 1000
audio_embeddings = []
temp_files = []
print(f"Slicing audio into {chunk_seconds}-second pieces...")
# 2. Chop the audio into pieces
# We use a loop to jump forward by chunk_length_ms each time
for i, start_ms in enumerate(range(0, len(audio), chunk_length_ms)):
# Extract the slice
chunk = audio[start_ms:start_ms + chunk_length_ms]
# Save it temporarily to your folder so the Gemini API can read it
chunk_name = f"temp_chunk_{i}.wav"
chunk.export(chunk_name, format="wav")
temp_files.append(chunk_name)
# 3. Embed this specific chunk
print(f" Embedding chunk {i + 1}...")
emb = embed_audio(chunk_name)
audio_embeddings.append(emb)
audio_embeddings = np.array(audio_embeddings)
# 4. Embed the search text
print(f"nEmbedding your search: '{search_phrase}'...")
text_emb = np.array([embed_text(search_phrase)])
# 5. Compare the text against all the audio chunks
print("Calculating similarities...")
sim_scores = cosine_similarity(text_emb, audio_embeddings)[0]
# Find the chunk with the highest score
best_chunk_idx = np.argmax(sim_scores)
best_score = sim_scores[best_chunk_idx]
# Calculate the timestamp
start_time = best_chunk_idx * chunk_seconds
end_time = start_time + chunk_seconds
print("n--- Results ---")
print(f"The concept '{search_phrase}' most closely matches the audio between {start_time}s and {end_time}s!")
print(f"Confidence score: {best_score:.3f}")
# --- RUN IT ---
# Replace with whatever phrase you are looking for!
search_audio_with_embeddings("fishing2.mp3", "yellow dolphin", chunk_seconds=5)
Here is the output.
Loading fishing2.mp3...
Slicing audio into 5-second pieces...
Embedding chunk 1...
Embedding chunk 2...
Embedding chunk 3...
Embedding chunk 4...
Embedding chunk 5...
Embedding chunk 6...
Embedding chunk 7...
Embedding chunk 8...
Embedding your search: 'yellow dolphin'...
Calculating similarities...
--- Results ---
The concept 'yellow dolphin' most closely matches the audio between 25s and 30s!
Confidence score: 0.643
That is quite accurate. If you listen to the audio again, the word “dolphin” is said at the 25 second mark and “bright yellow” is said at the 29 second mark. Earlier in the audio, I deliberately introduced the phrase “yellow sun” to see if the model would get confused, but it handled the disturbance well.
Summary
This article previews Gemini Embeddings 2 as Google's new all-in-one embedding model for text, PDFs, images, audio, and video. It explains why that is important for RAG systems, where embedding helps transform content and search queries into vectors that can be compared for similarity.
I then walk through two Python examples that show how to embed images and audio with the Google GenAI SDK, using match points to match text and image queries, and chunk audio into small segments to identify the part of the spoken recording that is closest to a given search phrase.
The opportunity to perform semantic searches beyond text and other documents is a real boon. Google's new embedding model promises to open up a host of new possibilities for multi-species search, retrieval, and recommendation systems, making it much easier to work with images, audio, video, and documents in one way. As the tool matures, it can become an effective foundation for rich RAG applications that understand more than text alone.
You can find the original blog post announcing the Gemini 2 embed using the link below.



