The NVIDIA Nemotron 3 Nano Omni model is now available on Amazon SageMaker JumpStart

Today, we are excited to announce the zero day availability of the NVIDIA Nemotron 3 Nano Omni on Amazon SageMaker JumpStart. This multimodal model from NVIDIA combines video, audio, image, and text understanding into a single, efficient architecture, allowing business customers to build intelligent applications that can see, hear, and think across modalities in a single decision.
In this post, we walk through the model structure and key capabilities of the Nemotron 3 Nano Omni, explore the business use cases it opens up, and show you how to deploy and implement inference using Amazon SageMaker JumpStart.
NVIDIA Nemotron 3 Nano Omni Overview
NVIDIA Nemotron 3 Nano Omni is a large open, mixed-mode language model with 30 billion parameters and 3 billion active parameters (30B A3B). It is built on the structure of the Mamba2 Transformer Hybrid Mixture of Experts (MoE), which includes three main components:
- Nemotron 3 Nano LLM like the backbone of the language
- CRADIO v4-H as a visual encoder for image and video understanding
- The Parakeet as a speech encoder for audio transcription and understanding
This integrated architecture processes video, audio, images, and text as input and produces text as output. It supports 131K token context length, logic threading, tool calling, JSON output, and name-level timestamps for write operations. The model is available with FP8 accuracy in SageMaker JumpStart, delivering the perfect balance of accuracy and efficiency for business workloads. Licensed under the NVIDIA Open Model Agreement for use. Agents must interpret screens, documents, audio, video, and text, often within the same thought loop. Today, many agent systems integrate different models of perception, speech, and language. This approach increases latency by passing multiple statements, complicates debugging and error handling, isolates the context of fragments across paths, and increases the cost and cost of failure over time.
The Nemotron 3 Nano Omni solves this by acting as a multimodal vision and sub-context agent in a system of agents. It provides the agent system with eyes and ears: reading screens, translating documents, writing speech, and analyzing video, all while maintaining a multimodal context integrated in all thinking loops. Nano Omni understands screens, documents, audio, and video in one thinking loop. This replaces separate model stacks and greatly simplifies workflow design. For anyone building an agent architecture, this wraps the inference hops, orchestration logic, and synchronization of different models into a single model call. The model accepts the following input types:
| Type of Installation | Supported Formats | Obstacles |
| Video | mp4 | Up to 2 minutes, up to 256 frames |
| The sound | oh, mp3 | Up to 1 hour, 8kHz+ sample rate |
| A photo | JPEG, PNG (RGB) | General maintenance |
| Text | A rope | Up to 131K context |
Business use cases
The multimodal capabilities of the Nemotron 3 Nano Omni make it a powerful, flexible model for business use cases.
Computer use agents
The Nemotron 3 Nano Omni enables loop vision for agents navigating graphical interfaces. It reads screens, understands UI state over time, and validates results, while execution agents manage actions. This wraps perception and thinking into a single loop, eliminating the need for dividing lines of perception. Functional applications include event management dashboards, agent search, automated browser, and email workflow agents.
Document intelligence
The model interprets documents, charts, tables, screenshots, and mixed media input, allowing agents to consider visual structure and textual content in parallel. This is important for business analysis and compliance with operations including contracts, statements of work, financial documents, and scientific documents.
Agents for understanding audio and video
For customer service, research, and workflow monitoring, the Nemotron 3 Nano Omni maintains a continuous audio and video environment. It ties together what is said, shown, and written into a single stream of thought instead of disconnected summaries. This enables applications such as meeting recording analysis, media and entertainment asset management, drive-thru order verification, and customer service video review (for example, confirming package delivery to a specific address via OCR).
Getting started with SageMaker JumpStart
You can configure the Nemotron 3 Nano Omni with Amazon SageMaker JumpStart in a few steps. SageMaker JumpStart provides one-click deployment of base models through advanced inference containers, eliminating the need to manage infrastructure, configure deployment frameworks, or manage artifact model downloads.
What is required
Before you begin, make sure you have:
Implement using SageMaker Studio
- Open Amazon SageMaker Studio
- In the left navigation pane, select JumpStart
- Search Nemotron 3 Nano Omni
- Select the card model and choose Use it
- Configure your instance type and deployment settings
- Select Use it creating an endpoint
Implement using the SageMaker Python SDK
You can also run programmatically using the SageMaker Python SDK:
from sagemaker.jumpstart.model import JumpStartModel
model = JumpStartModel(
model_id="huggingface-vlm-nvidia-nemotron3-nano-omni-30ba3b-reasoning-fp8",
role="",
)
predictor = model.deploy(
accept_eula=True,
)
Run inference: Understanding the image
Once implemented, you can send multimodal requests to the end. The following example shows how to send an image recognition request:
import base64
def encode_image(image_path):
with open(image_path, "rb") as f:
return base64.b64encode(f.read()).decode("utf-8")
image_b64 = encode_image("example.jpg")
payload = {
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image in detail."},
{"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}},
],
}],
"max_tokens": 1024,
"temperature": 0.2,
}
response = predictor.predict(payload)
print(response["choices"][0]["message"]["content"])
Run inference: Understanding video through reasoning
import base64
def encode_video(video_path):
with open(video_path, "rb") as f:
return base64.b64encode(f.read()).decode("utf-8")
video_b64 = encode_video("meeting_recording.mp4")
payload = {
"messages": [{
"role": "user",
"content": [
{"type": "video_url",
"video_url": {"url": f"data:video/mp4;base64,{video_b64}"}},
{"type": "text",
"text": "Summarize the key discussion points."},
],
}],
"max_tokens": 20480,
"temperature": 0.6,
"top_p": 0.95,
}
response = predictor.predict(payload)
print(response["choices"][0]["message"]["content"])
Start thinking: Audio transcription
import base64
def encode_audio(audio_path):
with open(audio_path, "rb") as f:
return base64.b64encode(f.read()).decode("utf-8")
audio_b64 = encode_audio("customer_call.wav")
payload = {
"messages": [{
"role": "user",
"content": [
{"type": "audio_url",
"audio_url": {"url": f"data:audio/wav;base64,{audio_b64}"}},
{"type": "text",
"text": "Transcribe this audio and identify key action items."},
],
}],
"max_tokens": 1024,
"temperature": 0.2,
}
response = predictor.predict(payload)
print(response["choices"][0]["message"]["content"])
Recommended targeting parameters
The following table contains recommended hyperparameter values for Omni touch applications. The values change depending on the inference mode.
| Mode | Temperature | up_p | max_tokens | Use Case |
| Thinking | 0.6 | 0.95 | 20480 | Complex thinking |
| Order | 0.2 | N/A | 1024 | Standard functions, ASR |
For tasks that involve thinking and complex understanding, we recommend enabling the thinking mode. For writing and specific tasks, the tutorial mode (with thinking disabled) provides quick answers.
Clean up
To avoid incurring unnecessary costs, delete the SageMaker repository when you're done:
predictor.delete_endpoint()
The conclusion
NVIDIA Nemotron 3 Nano Omni brings a new level of multimodal intelligence to Amazon SageMaker JumpStart. By combining video, audio, image, and text understanding into one efficient model, it simplifies the development of enterprise agent applications while delivering up to 9x higher throughput and lead accuracy compared to other open omni models.
Whether you're building virtualized agents that navigate GUIs, intelligence pipelines that automate workflows, or audio and video analytics systems for customer service, the Nemotron 3 Nano Omni provides the layer of vision your agents need in a single model phone.
Get started today by using the Nemotron 3 Nano Omni from Amazon SageMaker JumpStart. For more information about the model, visit the NVIDIA Nemotron model page on Hugging Face.
About the writers



