How Llama 4's Mixture-of-Experts Architecture Works
Meta's Llama 4 family is the first Llama generation to ship as a Mixture-of-Experts (MoE) architecture. That single design choice explains most of what's different about Scout and Maverick — including why both have "17 billion active parameters" but very different total parameter counts, and why the smaller-looking Scout still needs serious GPU memory to run.
MoE in one paragraph
In a dense transformer, every parameter activates on every token — a 70B dense model does ~70B operations per token. In a Mixture-of-Experts model, the feed-forward layers are split into many parallel "experts," and a small router picks just a few experts per token. The model's total parameter count is large (sum of all experts), but the active parameter count per token is small (only the chosen experts plus shared layers).
The result: MoE models can be much larger in total capacity than dense models, while running at roughly the latency and FLOPs of a much smaller dense model.
Scout vs Maverick: same active, different total
Per Meta's release and Hugging Face documentation:
Both models pick a small number of experts per token, so per-token compute is similar. The differences are in total knowledge capacity and the routing dynamics.
Why same active params can give different quality
Maverick's 128 experts allow finer specialization — each expert can carve out a narrower slice of the input distribution. Scout's 16 experts are broader. On most reasoning and instruction-following benchmarks, Maverick leads Scout. The tradeoff is that Maverick's 400B total parameters need to live somewhere — typically multiple GPUs.
The memory tradeoff teams miss
This is the most frequently misunderstood part of MoE deployment.
"MoE saves FLOPs (compute), not memory footprint."
Even if only 17B parameters compute on each token, all the experts need to be in GPU memory — otherwise routing requests to off-loaded experts costs you a CPU↔GPU transfer per token, which destroys throughput. So:
The deployment decision is rarely "active params" — it's "do I have the GPU memory to hold all experts hot?"
Routing: what the gating function actually does
Each MoE layer has a small router network (often a single linear layer plus softmax). For every token, the router scores all experts and picks the top-k (typically top-2). Those experts run their feed-forward computation; the others stay idle for that token.
Two important properties:
Multimodality and 10M context
Llama 4 is also Meta's first natively multimodal Llama line — Scout and Maverick both ingest text and images out of the box. Scout's headline is its 10M token context window, the longest among any open-weight or proprietary model. Maverick clocks in at 1M tokens.
Two practical caveats:
Training details
Per Meta's release: Scout was pretrained on roughly 40 trillion tokens of multimodal data; Maverick on roughly 22 trillion tokens. Both were trained on a mix of publicly available, licensed, and Meta-product data. The bigger surprise was that Scout — the smaller of the two — got more training tokens, which is consistent with the modern open-research pattern of training smaller models longer for better per-parameter quality.
When MoE is right (and when it isn't)
Pick MoE when:
Pick dense when:
The 2026 industry trend is unambiguous: above ~100B parameters, almost every flagship model is now MoE. DeepSeek V4, Qwen 3.5, Mistral Large 3, and Llama 4 all sit on the MoE side. Mistral Medium 3.5 is the notable counterexample at 128B dense, deliberately positioning against the MoE pattern.
How to think about Llama 4 specifically
If you're picking between Scout and Maverick:
Either way, you're now operating an MoE — and the operational lessons (memory budgeting, expert-parallel sharding, KV-cache scaling) are the new operational baseline for serving open-weight models at scale.