LLM Chatfastaffordable

Mistral Small 4

by Mistral · Released March 16, 2026

Mistral AI's unified hybrid model combining instruct, reasoning (Magistral), and coding (Devstral) capabilities. 119B total parameters with 6.5B active. Features a reasoning_effort parameter, multimodal text+image input, and an industry-first unified architecture replacing three separate models.

LLM Chat

Mistral Small 4

Powered by Mistral · Hybrid MoE (119B total / 6.5B active)

Context Window

128K

Parameters

119B total / 6.5B active (MoE)

Max Output

16K

Category

LLM Chat

Overview

Mistral Small 4, released March 16, 2026, is the first Mistral model to unify three previously separate capability lines — instruct (Mistral), reasoning (Magistral), multimodal (Pixtral), and coding (Devstral) — into a single 119B total parameter MoE architecture with only 6.5B active parameters per token (8B including embedding and output layers). This unification is an industry first, eliminating the need to route between specialized models and simplifying production deployments significantly.

The architecture features 128 experts with 4 active per token, a 256K context window, and an innovative reasoning_effort parameter that lets developers control reasoning depth: "none" for fast direct responses and "high" for deep chain-of-thought reasoning. It supports multimodal text+image input and is released under the Apache 2.0 license, making it one of the most capable fully open models available for commercial use without restrictions.

Efficiency per token is a defining characteristic. On AA LCR (a conciseness-adjusted benchmark), Mistral Small 4 scores 0.72 with only 1.6K characters of output, while Qwen models need 5.8-6.1K characters to achieve comparable performance — meaning Mistral Small 4 delivers the same quality answer in roughly one-quarter the tokens. On LiveCodeBench, it outperforms GPT-OSS-120B while using 20% less output. This efficiency per token directly impacts cost and scalability in production, as fewer output tokens mean lower API bills and faster response times.

Performance-wise, Mistral Small 4 achieves a 40% latency reduction and 3x throughput improvement over Mistral Small 3, while being competitive with GPT-OSS-120B on benchmarks. The model is a founding member of the NVIDIA Nemotron Coalition and is available day-0 as an NVIDIA NIM for optimized containerized inference, enabling seamless deployment on NVIDIA infrastructure. It can also be customized with NVIDIA NeMo for domain-specific fine-tuning.

For self-hosting, Mistral Small 4 requires a minimum of 4x HGX H100, 2x HGX H200, or 1x DGX B200, and is supported on vLLM, llama.cpp, SGLang, and Transformers. The model is also available through La Plateforme (Mistral's API), Google Cloud Vertex AI, Amazon Bedrock, and Azure AI Foundry, as well as through CallMissed's unified gateway.

At $0.20 per million input tokens and $0.80 per million output tokens, Mistral Small 4 is among the most affordable frontier-class models available. The combination of unified capabilities across instruct, reasoning, coding, and multimodal tasks, extreme efficiency (6.5B active from 119B total), Apache 2.0 licensing, NVIDIA NIM availability, and ultra-affordable pricing makes it a compelling choice for production deployments that need reasoning, coding, and general capabilities in a single model without the complexity of routing between specialized systems.

Pricing

MetricPrice
Input /1M tokens₹20.0000
Output /1M tokens₹80.0000

1 credit = ₹1 = $0.01 USD. Prices shown from provider; CallMissed passes through with ~35% markup.

Key Highlights

  • Unifies instruct, reasoning, and coding in one model
  • 119B total params, only 6.5B active — ultra-efficient
  • Industry-first reasoning_effort parameter
  • Multimodal: text and image input

Benchmarks

BenchmarkScore
MMLU-Pro78.2%
HumanEval86.8%
MATH-50084.5%
AA LCR0.72
Throughput3x Small 3
Latency-40%

Technical Details

  • Architecture: 119B total MoE with 128 experts, 4 active per token (6.5B active, 8B incl. embedding/output)
  • Unifies Magistral (reasoning) + Pixtral (multimodal) + Devstral (coding)
  • reasoning_effort parameter: "none" for fast, "high" for deep reasoning
  • Context window: 256K tokens
  • Apache 2.0 license — full commercial freedom
  • 40% latency reduction, 3x throughput vs Mistral Small 3
  • Competitive with GPT-OSS-120B on benchmarks with shorter outputs
  • Minimum hardware: 4x HGX H100, 2x HGX H200, or 1x DGX B200
  • Available on vLLM, llama.cpp, SGLang, Transformers
  • Available via Mistral API and CallMissed unified gateway

Strengths

  • Unifies instruct, reasoning, and coding — no need for model routing
  • Only 6.5B active params from 119B total — extremely efficient
  • Apache 2.0 license with full commercial freedom
  • reasoning_effort parameter for flexible compute-quality trade-off
  • Ultra-affordable at $0.20/$0.80 per 1M tokens

Limitations

  • Lower absolute capability than larger frontier models (GPT-5.4, Opus 4.6)
  • 6.5B active parameters limits depth on the most complex reasoning tasks
  • Newer unified architecture with less production track record

Use Cases

Code generationReasoning tasksMultimodal analysisCost-efficient deployment

API Example

curl https://api.callmissed.com/v1/chat/completions \
  -H "Authorization: Bearer cm_YOUR_KEY" \
  -d '{"model": "mistralai/mistral-small-2603", "messages": [{"role": "user", "content": "Write a Rust function with error handling"}]}'

Endpoint: POST /v1/chat/completions · Model ID: mistralai/mistral-small-2603

Try Mistral Small 4 now

Get 1000 free API credits on signup. No credit card required.