DailyGlimpse

Nvidia’s Nemotron 3 Nano Omni: How Open Data and Rival Models Power the New Multimodal AI

AI
April 29, 2026 · 1:37 PM
Nvidia’s Nemotron 3 Nano Omni: How Open Data and Rival Models Power the New Multimodal AI

Nvidia has unveiled Nemotron 3 Nano Omni, an open-source multimodal model capable of processing text, images, video, and audio within a single architecture. The 30-billion-parameter model employs a Mamba-Transformer hybrid with Mixture-of-Experts, activating roughly three billion parameters per query. It runs on Nvidia's C-RADIOv4-H vision encoder and Parakeet-TDT audio encoder, with a context window of up to 256,000 tokens. English is the only officially supported language.

According to the technical report, Nemotron 3 Nano Omni is primarily designed for agentic applications, including document processing, computer-use agents, video and audio analysis, and voice interaction. On benchmarks like OCRBenchV2, MMLongBench-Doc, WorldSense, and VoiceBench, the model surpasses its predecessor, Nemotron Nano V2 VL, and competes with Alibaba's Qwen3-Omni. On the OSWorld GUI agent benchmark, accuracy jumped from 11.1 to 47.4 points compared to the previous version. Nvidia claims throughput at the same interactivity level is up to nine times higher than Qwen3-Omni.

How Rival Models Shaped the Training Data

The benchmarks are noteworthy, but the training data details—rarely disclosed in open-source releases—are particularly revealing. Nvidia processed roughly 717 billion tokens across seven training stages, progressively expanding the context window.

A substantial portion of synthetic training data was generated using competing models. Image captions, question-answer pairs, and reasoning traces were produced with Qwen3-VL-30B-A3B-Instruct, Qwen3.5-122B-A10B, Qwen2.5-VL-72B-Instruct, OpenAI's gpt-oss-120b, Kimi-K2.5, GLM-4.1V-9B-Thinking, and DeepSeek-OCR. Nvidia also used GPT-4o and Gemini 3 Flash Preview for filtering.

Using other models to train new ones is common industry practice, though most developers are not as transparent about it. Companies like OpenAI, Anthropic, and Google have repeatedly accused Chinese AI labs of large-scale distillation efforts.

The audio training data includes Nvidia's Granary and SIFT-50M datasets, along with captions from Qwen's Omni-Captioner. For reinforcement learning, the team built a five-stage pipeline spanning 25 environments, covering tasks such as visual grounding, chart and document understanding, GUI clicks, and automatic speech recognition.

Along with model weights in BF16, FP8, and NVFP4, Nvidia is releasing parts of the training data, the training pipelines on Megatron-Bridge, and the RL recipes on NeMo-RL. This distinguishes the release from projects that only ship weights. Reasoning mode is enabled by default; users must manually disable it for tasks that don't require chain-of-thought. The model is available under the NVIDIA Open Model Agreement, which permits commercial use.