The NVIDIA Nemotron 3 Nano Omni model is now available on Amazon SageMaker JumpStart

0 0 5 minutes read

The NVIDIA Nemotron 3 Nano Omni model is now available on Amazon SageMaker JumpStart

Today, we are excited to announce the zero day availability of the NVIDIA Nemotron 3 Nano Omni on Amazon SageMaker JumpStart. This multimodal model from NVIDIA combines video, audio, image, and text understanding into a single, efficient architecture, allowing business customers to build intelligent applications that can see, hear, and think across modalities in a single decision.

In this post, we walk through the model structure and key capabilities of the Nemotron 3 Nano Omni, explore the business use cases it opens up, and show you how to deploy and implement inference using Amazon SageMaker JumpStart.

NVIDIA Nemotron 3 Nano Omni Overview

NVIDIA Nemotron 3 Nano Omni is a large open, mixed-mode language model with 30 billion parameters and 3 billion active parameters (30B A3B). It is built on the structure of the Mamba2 Transformer Hybrid Mixture of Experts (MoE), which includes three main components:

Nemotron 3 Nano LLM like the backbone of the language
CRADIO v4-H as a visual encoder for image and video understanding
The Parakeet as a speech encoder for audio transcription and understanding

This integrated architecture processes video, audio, images, and text as input and produces text as output. It supports 131K token context length, logic threading, tool calling, JSON output, and name-level timestamps for write operations. The model is available with FP8 accuracy in SageMaker JumpStart, delivering the perfect balance of accuracy and efficiency for business workloads. Licensed under the NVIDIA Open Model Agreement for use. Agents must interpret screens, documents, audio, video, and text, often within the same thought loop. Today, many agent systems integrate different models of perception, speech, and language. This approach increases latency by passing multiple statements, complicates debugging and error handling, isolates the context of fragments across paths, and increases the cost and cost of failure over time.

The Nemotron 3 Nano Omni solves this by acting as a multimodal vision and sub-context agent in a system of agents. It provides the agent system with eyes and ears: reading screens, translating documents, writing speech, and analyzing video, all while maintaining a multimodal context integrated in all thinking loops. Nano Omni understands screens, documents, audio, and video in one thinking loop. This replaces separate model stacks and greatly simplifies workflow design. For anyone building an agent architecture, this wraps the inference hops, orchestration logic, and synchronization of different models into a single model call. The model accepts the following input types:

Type of Installation	Supported Formats	Obstacles
Video	mp4	Up to 2 minutes, up to 256 frames
The sound	oh, mp3	Up to 1 hour, 8kHz+ sample rate
A photo	JPEG, PNG (RGB)	General maintenance
Text	A rope	Up to 131K context

Business use cases

The multimodal capabilities of the Nemotron 3 Nano Omni make it a powerful, flexible model for business use cases.

Computer use agents

The Nemotron 3 Nano Omni enables loop vision for agents navigating graphical interfaces. It reads screens, understands UI state over time, and validates results, while execution agents manage actions. This wraps perception and thinking into a single loop, eliminating the need for dividing lines of perception. Functional applications include event management dashboards, agent search, automated browser, and email workflow agents.

Document intelligence

The model interprets documents, charts, tables, screenshots, and mixed media input, allowing agents to consider visual structure and textual content in parallel. This is important for business analysis and compliance with operations including contracts, statements of work, financial documents, and scientific documents.

Agents for understanding audio and video

For customer service, research, and workflow monitoring, the Nemotron 3 Nano Omni maintains a continuous audio and video environment. It ties together what is said, shown, and written into a single stream of thought instead of disconnected summaries. This enables applications such as meeting recording analysis, media and entertainment asset management, drive-thru order verification, and customer service video review (for example, confirming package delivery to a specific address via OCR).

Getting started with SageMaker JumpStart

You can configure the Nemotron 3 Nano Omni with Amazon SageMaker JumpStart in a few steps. SageMaker JumpStart provides one-click deployment of base models through advanced inference containers, eliminating the need to manage infrastructure, configure deployment frameworks, or manage artifact model downloads.

What is required

Before you begin, make sure you have:

Implement using SageMaker Studio

Open Amazon SageMaker Studio
In the left navigation pane, select JumpStart
Search Nemotron 3 Nano Omni
Select the card model and choose Use it
Configure your instance type and deployment settings
Select Use it creating an endpoint

Implement using the SageMaker Python SDK

You can also run programmatically using the SageMaker Python SDK:

from sagemaker.jumpstart.model import JumpStartModel

model = JumpStartModel(
  model_id="huggingface-vlm-nvidia-nemotron3-nano-omni-30ba3b-reasoning-fp8",
  role="",
)

predictor = model.deploy(
  accept_eula=True,
)

Run inference: Understanding the image

Once implemented, you can send multimodal requests to the end. The following example shows how to send an image recognition request:

import base64

def encode_image(image_path):
  with open(image_path, "rb") as f:
    return base64.b64encode(f.read()).decode("utf-8")

image_b64 = encode_image("example.jpg")

payload = {
  "messages": [{ 
    "role": "user", 
    "content": [ 
      {"type": "text", "text": "Describe this image in detail."},
      {"type": "image_url", 
       "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}},
    ],
  }],
  "max_tokens": 1024,
  "temperature": 0.2,
}

response = predictor.predict(payload)
print(response["choices"][0]["message"]["content"])

Run inference: Understanding video through reasoning

import base64

def encode_video(video_path):
  with open(video_path, "rb") as f:
    return base64.b64encode(f.read()).decode("utf-8")

video_b64 = encode_video("meeting_recording.mp4")

payload = { 
  "messages": [{ 
    "role": "user", 
    "content": [ 
      {"type": "video_url", 
       "video_url": {"url": f"data:video/mp4;base64,{video_b64}"}},
      {"type": "text",
       "text": "Summarize the key discussion points."},
    ],
  }],
  "max_tokens": 20480,
  "temperature": 0.6,
  "top_p": 0.95,
}

response = predictor.predict(payload)
print(response["choices"][0]["message"]["content"])

Start thinking: Audio transcription

import base64

def encode_audio(audio_path): 
  with open(audio_path, "rb") as f: 
    return base64.b64encode(f.read()).decode("utf-8")

audio_b64 = encode_audio("customer_call.wav")

payload = { 
  "messages": [{ 
    "role": "user", 
    "content": [ 
      {"type": "audio_url", 
       "audio_url": {"url": f"data:audio/wav;base64,{audio_b64}"}},
      {"type": "text", 
       "text": "Transcribe this audio and identify key action items."},
    ],
  }],
  "max_tokens": 1024,
  "temperature": 0.2,
}

response = predictor.predict(payload)
print(response["choices"][0]["message"]["content"])

Recommended targeting parameters

The following table contains recommended hyperparameter values for Omni touch applications. The values change depending on the inference mode.

Mode	Temperature	up_p	max_tokens	Use Case
Thinking	0.6	0.95	20480	Complex thinking
Order	0.2	N/A	1024	Standard functions, ASR

For tasks that involve thinking and complex understanding, we recommend enabling the thinking mode. For writing and specific tasks, the tutorial mode (with thinking disabled) provides quick answers.

Clean up

To avoid incurring unnecessary costs, delete the SageMaker repository when you're done:

predictor.delete_endpoint()

The conclusion

NVIDIA Nemotron 3 Nano Omni brings a new level of multimodal intelligence to Amazon SageMaker JumpStart. By combining video, audio, image, and text understanding into one efficient model, it simplifies the development of enterprise agent applications while delivering up to 9x higher throughput and lead accuracy compared to other open omni models.

Whether you're building virtualized agents that navigate GUIs, intelligence pipelines that automate workflows, or audio and video analytics systems for customer service, the Nemotron 3 Nano Omni provides the layer of vision your agents need in a single model phone.

Get started today by using the Nemotron 3 Nano Omni from Amazon SageMaker JumpStart. For more information about the model, visit the NVIDIA Nemotron model page on Hugging Face.