Mistral Small 4
by Mistral · Released March 16, 2026
Mistral AI's unified hybrid model combining instruct, reasoning (Magistral), and coding (Devstral) capabilities. 119B total parameters with 6.5B active. Features a reasoning_effort parameter, multimodal text+image input, and an industry-first unified architecture replacing three separate models.
Mistral Small 4
Powered by Mistral · Hybrid MoE (119B total / 6.5B active)
Context Window
128K
Parameters
119B total / 6.5B active (MoE)
Max Output
16K
Category
LLM Chat
Overview
Mistral Small 4, released March 16, 2026, is the first Mistral model to unify three previously separate capability lines — instruct (Mistral), reasoning (Magistral), multimodal (Pixtral), and coding (Devstral) — into a single 119B total parameter MoE architecture with only 6.5B active parameters per token (8B including embedding and output layers). This unification is an industry first, eliminating the need to route between specialized models and simplifying production deployments significantly.
The architecture features 128 experts with 4 active per token, a 256K context window, and an innovative reasoning_effort parameter that lets developers control reasoning depth: "none" for fast direct responses and "high" for deep chain-of-thought reasoning. It supports multimodal text+image input and is released under the Apache 2.0 license, making it one of the most capable fully open models available for commercial use without restrictions.
Efficiency per token is a defining characteristic. On AA LCR (a conciseness-adjusted benchmark), Mistral Small 4 scores 0.72 with only 1.6K characters of output, while Qwen models need 5.8-6.1K characters to achieve comparable performance — meaning Mistral Small 4 delivers the same quality answer in roughly one-quarter the tokens. On LiveCodeBench, it outperforms GPT-OSS-120B while using 20% less output. This efficiency per token directly impacts cost and scalability in production, as fewer output tokens mean lower API bills and faster response times.
Performance-wise, Mistral Small 4 achieves a 40% latency reduction and 3x throughput improvement over Mistral Small 3, while being competitive with GPT-OSS-120B on benchmarks. The model is a founding member of the NVIDIA Nemotron Coalition and is available day-0 as an NVIDIA NIM for optimized containerized inference, enabling seamless deployment on NVIDIA infrastructure. It can also be customized with NVIDIA NeMo for domain-specific fine-tuning.
For self-hosting, Mistral Small 4 requires a minimum of 4x HGX H100, 2x HGX H200, or 1x DGX B200, and is supported on vLLM, llama.cpp, SGLang, and Transformers. The model is also available through La Plateforme (Mistral's API), Google Cloud Vertex AI, Amazon Bedrock, and Azure AI Foundry, as well as through CallMissed's unified gateway.
At $0.20 per million input tokens and $0.80 per million output tokens, Mistral Small 4 is among the most affordable frontier-class models available. The combination of unified capabilities across instruct, reasoning, coding, and multimodal tasks, extreme efficiency (6.5B active from 119B total), Apache 2.0 licensing, NVIDIA NIM availability, and ultra-affordable pricing makes it a compelling choice for production deployments that need reasoning, coding, and general capabilities in a single model without the complexity of routing between specialized systems.
Pricing
| Metric | Price |
|---|---|
| Input /1M tokens | ₹20.0000 |
| Output /1M tokens | ₹80.0000 |
1 credit = ₹1 = $0.01 USD. Prices shown from provider; CallMissed passes through with ~35% markup.
Key Highlights
- Unifies instruct, reasoning, and coding in one model
- 119B total params, only 6.5B active — ultra-efficient
- Industry-first reasoning_effort parameter
- Multimodal: text and image input
Benchmarks
| Benchmark | Score |
|---|---|
| MMLU-Pro | 78.2% |
| HumanEval | 86.8% |
| MATH-500 | 84.5% |
| AA LCR | 0.72 |
| Throughput | 3x Small 3 |
| Latency | -40% |
Technical Details
- Architecture: 119B total MoE with 128 experts, 4 active per token (6.5B active, 8B incl. embedding/output)
- Unifies Magistral (reasoning) + Pixtral (multimodal) + Devstral (coding)
- reasoning_effort parameter: "none" for fast, "high" for deep reasoning
- Context window: 256K tokens
- Apache 2.0 license — full commercial freedom
- 40% latency reduction, 3x throughput vs Mistral Small 3
- Competitive with GPT-OSS-120B on benchmarks with shorter outputs
- Minimum hardware: 4x HGX H100, 2x HGX H200, or 1x DGX B200
- Available on vLLM, llama.cpp, SGLang, Transformers
- Available via Mistral API and CallMissed unified gateway
Strengths
- Unifies instruct, reasoning, and coding — no need for model routing
- Only 6.5B active params from 119B total — extremely efficient
- Apache 2.0 license with full commercial freedom
- reasoning_effort parameter for flexible compute-quality trade-off
- Ultra-affordable at $0.20/$0.80 per 1M tokens
Limitations
- Lower absolute capability than larger frontier models (GPT-5.4, Opus 4.6)
- 6.5B active parameters limits depth on the most complex reasoning tasks
- Newer unified architecture with less production track record
Use Cases
API Example
curl https://api.callmissed.com/v1/chat/completions \
-H "Authorization: Bearer cm_YOUR_KEY" \
-d '{"model": "mistralai/mistral-small-2603", "messages": [{"role": "user", "content": "Write a Rust function with error handling"}]}'Endpoint: POST /v1/chat/completions · Model ID: mistralai/mistral-small-2603
Try Mistral Small 4 now
Get 1000 free API credits on signup. No credit card required.