How Llama 4's Mixture-of-Experts Architecture Works

CallMissed
·5 min readGuide

Meta's Llama 4 family is the first Llama generation to ship as a Mixture-of-Experts (MoE) architecture. That single design choice explains most of what's different about Scout and Maverick — including why both have "17 billion active parameters" but very different total parameter counts, and why the smaller-looking Scout still needs serious GPU memory to run.

MoE in one paragraph

In a dense transformer, every parameter activates on every token — a 70B dense model does ~70B operations per token. In a Mixture-of-Experts model, the feed-forward layers are split into many parallel "experts," and a small router picks just a few experts per token. The model's total parameter count is large (sum of all experts), but the active parameter count per token is small (only the chosen experts plus shared layers).

The result: MoE models can be much larger in total capacity than dense models, while running at roughly the latency and FLOPs of a much smaller dense model.

Scout vs Maverick: same active, different total

Per Meta's release and Hugging Face documentation:

  • Llama 4 Scout — 17B active parameters, 16 experts, 109B total parameters
  • Llama 4 Maverick — 17B active parameters, 128 experts, 400B total parameters
  • Both models pick a small number of experts per token, so per-token compute is similar. The differences are in total knowledge capacity and the routing dynamics.

    Why same active params can give different quality

    Maverick's 128 experts allow finer specialization — each expert can carve out a narrower slice of the input distribution. Scout's 16 experts are broader. On most reasoning and instruction-following benchmarks, Maverick leads Scout. The tradeoff is that Maverick's 400B total parameters need to live somewhere — typically multiple GPUs.

    The memory tradeoff teams miss

    This is the most frequently misunderstood part of MoE deployment.

    "MoE saves FLOPs (compute), not memory footprint."

    Even if only 17B parameters compute on each token, all the experts need to be in GPU memory — otherwise routing requests to off-loaded experts costs you a CPU↔GPU transfer per token, which destroys throughput. So:

  • Scout (109B total, ~17B active): loadable on a single high-memory GPU node, comfortable on dual-A100 or single-H100 80GB setups depending on quantization.
  • Maverick (400B total, ~17B active): needs a multi-GPU node. Expert parallelism across GPUs is the standard pattern.
  • The deployment decision is rarely "active params" — it's "do I have the GPU memory to hold all experts hot?"

    Routing: what the gating function actually does

    Each MoE layer has a small router network (often a single linear layer plus softmax). For every token, the router scores all experts and picks the top-k (typically top-2). Those experts run their feed-forward computation; the others stay idle for that token.

    Two important properties:

  • Routing is learned during training. Experts specialize organically — some end up handling math-heavy tokens, others code, others natural language. You don't manually assign domains.
  • Load balancing is enforced. Without an auxiliary loss, the router would collapse to a few favorite experts and waste the rest. Modern MoE training adds a load-balancing loss to keep all experts utilized.
  • Multimodality and 10M context

    Llama 4 is also Meta's first natively multimodal Llama line — Scout and Maverick both ingest text and images out of the box. Scout's headline is its 10M token context window, the longest among any open-weight or proprietary model. Maverick clocks in at 1M tokens.

    Two practical caveats:

  • Effective recall at 10M is not the same as 10M with full accuracy. Independent testing of long-context models in 2026 shows accuracy decays well before the advertised maximum. Scout still leads the field on long-context tasks but works best as a retrieval pre-filter rather than a "stuff the whole codebase in" call.
  • Memory cost of context scales with sequence length. Even a model that fits comfortably for 10K tokens can OOM on a 1M-token prompt. Plan for KV-cache memory separately from model weights.
  • Training details

    Per Meta's release: Scout was pretrained on roughly 40 trillion tokens of multimodal data; Maverick on roughly 22 trillion tokens. Both were trained on a mix of publicly available, licensed, and Meta-product data. The bigger surprise was that Scout — the smaller of the two — got more training tokens, which is consistent with the modern open-research pattern of training smaller models longer for better per-parameter quality.

    When MoE is right (and when it isn't)

    Pick MoE when:

  • You want frontier-grade quality with sub-frontier-cost inference
  • You have GPU memory to spare for total params, not just active
  • Your traffic mix is diverse (the router benefits from variety)
  • Pick dense when:

  • You're deploying on a single small GPU and total params dominate cost
  • You need very predictable latency (MoE routing adds some variance)
  • Your fine-tuning workflow assumes the simpler dense gradient path
  • The 2026 industry trend is unambiguous: above ~100B parameters, almost every flagship model is now MoE. DeepSeek V4, Qwen 3.5, Mistral Large 3, and Llama 4 all sit on the MoE side. Mistral Medium 3.5 is the notable counterexample at 128B dense, deliberately positioning against the MoE pattern.

    How to think about Llama 4 specifically

    If you're picking between Scout and Maverick:

  • Scout is the practical choice when you need long context, on-prem deployment, or single-node serving. It loses some quality vs. Maverick but gains operational simplicity.
  • Maverick is the practical choice when quality is paramount and you have multi-GPU infrastructure. The 128-expert routing produces stronger results across most benchmarks.
  • Either way, you're now operating an MoE — and the operational lessons (memory budgeting, expert-parallel sharding, KV-cache scaling) are the new operational baseline for serving open-weight models at scale.

    Frequently Asked Questions

    What's the difference between Llama 4 Scout and Maverick?
    Scout has 16 experts, 109B total parameters, and a 10M-token context window — best for long-context and single-node deployment. Maverick has 128 experts, 400B total parameters, and a 1M-token context window — better quality but requires multi-GPU memory. Both have ~17B active parameters per token.
    Does MoE save GPU memory?
    No. MoE saves compute (FLOPs) per token because only a few experts activate, but all experts must be loaded in GPU memory simultaneously. Total parameters drive memory cost; active parameters drive compute and latency.
    Is Llama 4 Scout's 10M context window actually usable?
    It works mechanically, and Scout leads long-context benchmarks among open-weight models. However, recall accuracy decays before the advertised maximum, especially when relevant facts are buried mid-context. For most production workloads, combining a long context with a retrieval step gives more reliable results than relying on the full 10M.

    Related Posts