Nemotron 3 Super
by NVIDIA · Released March 11, 2026
NVIDIA's hybrid Mamba-Transformer MoE model. 120B total parameters with only 12B active per token, delivering 2.2x higher throughput than GPT-OSS-120B. Features a 1M token context window and is optimized for agentic reasoning, software development, and cybersecurity.
Nemotron 3 Super
Powered by NVIDIA · Hybrid Mamba-Transformer MoE (120B total / 12B active)
Context Window
128K
Parameters
120B total / 12B active (Hybrid MoE)
Max Output
16K
Category
LLM Chat
Overview
Nemotron 3 Super, released March 11, 2026, is NVIDIA's flagship language model that directly addresses two critical problems in production AI: the "thinking tax" (using massive reasoning models for every sub-task when most sub-tasks are simple) and "context explosion" (multi-agent systems generating 15x the tokens of standard chats, inflating costs and latency). With 120B total parameters but only 12B active per token, it achieves 2.2x higher throughput than GPT-OSS-120B and 5x throughput over the previous Nemotron Super.
The architecture introduces several innovations that work together for efficiency. Latent MoE routing compresses tokens before they reach the expert networks, allowing the system to call 4x as many specialist experts for the same inference cost. Multi-token prediction predicts multiple future tokens in a single forward pass, enabling built-in speculative decoding without requiring a separate draft model. The hybrid backbone interleaves Mamba-2 layers (which handle the majority of sequence processing with linear-time complexity) with Transformer attention layers at key depths for precise associative recall — combining the efficiency of state-space models with the precision of attention.
Native NVFP4 pretraining is a key differentiator. By training natively in 4-bit floating point on NVIDIA Blackwell GPUs, Nemotron 3 Super achieves a 4x speedup on B200 versus FP8 on H100 while maintaining accuracy. This is not post-training quantization — the model was pretrained in NVFP4 from the start, avoiding the quality degradation that typically accompanies aggressive quantization.
The model was RL post-trained using NeMo Gym and NeMo RL across 21 diverse environment configurations with over 1.2 million rollouts, giving it strong agentic reasoning capabilities grounded in real task completion rather than just instruction following. On PinchBench, it scores 85.6% — the best open model for the OpenClaw agent benchmark. It also leads on AIME 2025, SWE-Bench, and terminal-bench among open models.
The hybrid Mamba-Transformer backbone makes the 1M token context window practical. Mamba-2 layers handle long-range dependencies with linear-time complexity (O(n) rather than O(n²)), while Transformer attention layers interleaved at key depths provide the precise associative recall needed for tasks like needle-in-a-haystack retrieval and exact quotation. This hybrid approach delivers the context length of a Transformer with the throughput characteristics of a state-space model.
Nemotron 3 Super is particularly strong for agentic reasoning, software development, cybersecurity triaging, and multi-agent systems where throughput and efficiency matter. Its combination of hybrid architecture, native Blackwell optimization, massive RL training across 21 environments, and 5x throughput improvement over its predecessor makes it a unique offering in the open model ecosystem — delivering frontier-class agentic performance at a fraction of the inference cost of comparable dense models.
Pricing
| Metric | Price |
|---|---|
| Input /1M tokens | ₹150.0000 |
| Output /1M tokens | ₹600.0000 |
1 credit = ₹1 = $0.01 USD. Prices shown from provider; CallMissed passes through with ~35% markup.
Key Highlights
- Hybrid Mamba-Transformer architecture
- 2.2x higher throughput than GPT-OSS-120B
- 120B total params, only 12B active per token
- Leads on AIME 2025, SWE-Bench, terminal-bench
Benchmarks
| Benchmark | Score |
|---|---|
| PinchBench | 85.6% |
| AIME 2025 | Leading |
| SWE-bench | Leading |
| Terminal-bench | Leading |
| Throughput vs GPT-OSS | 2.2x |
| Throughput vs Prev | 5x |
Technical Details
- Hybrid Mamba-Transformer MoE: 120B total / 12B active per token
- Latent MoE: calls 4x as many expert specialists for the same inference cost
- Multi-token prediction: predicts multiple future tokens in one forward pass
- Hybrid backbone: Mamba for sequence efficiency, Transformer for precision
- Native NVFP4 pretraining for Blackwell: 4x speedup on B200 vs FP8 on H100
- RL post-trained across 21 environments with 1.2M rollouts using NeMo Gym
- 5x throughput over previous Nemotron Super, 2.2x over GPT-OSS-120B
- 1M token context window (practical via Mamba layers)
- Available via NVIDIA API and CallMissed unified gateway
Strengths
- 2.2x throughput over GPT-OSS-120B — exceptional efficiency
- Hybrid Mamba-Transformer architecture combines best of both worlds
- Only 12B active params per token despite 120B total — very cost-efficient
- Native NVFP4 for Blackwell GPUs — optimized for latest NVIDIA hardware
- RL post-trained across 21 environments for strong agentic capabilities
Limitations
- Hybrid architecture is newer and less battle-tested than pure Transformers
- Optimized primarily for NVIDIA hardware — less portable to other accelerators
- Higher pricing than GPT-OSS-120B despite being built on it
Use Cases
API Example
curl https://api.callmissed.com/v1/chat/completions \
-H "Authorization: Bearer cm_YOUR_KEY" \
-d '{"model": "nemotron-3-super", "messages": [{"role": "user", "content": "Debug this Kubernetes deployment configuration"}]}'Endpoint: POST /v1/chat/completions · Model ID: nemotron-3-super
Try Nemotron 3 Super now
Get 1000 free API credits on signup. No credit card required.