The GPU Scarcity Story: H100, H200, and B200

CallMissed
·5 min readArticle

"GPU shortage" was the defining infrastructure story of 2023 and 2024. By 2026 the story has shifted — but it has not gone away. Hopper-generation supply has loosened, Blackwell is ramping but constrained, and the gap between "what you can buy on a credit card" and "what you can buy with a multi-year commit" has only widened. Here is the picture as of mid-2026.

H100: from scarce to soft

The H100, NVIDIA's 2022-launched Hopper flagship, drove most of the 2023–2024 scarcity headlines. By 2026 it is widely available across cloud providers, with on-demand rates falling steadily as Blackwell capacity comes online. Spot and preemptible H100 pricing has slid first; on-demand list prices have followed. (Spheron blog) [Unverified — directional]

For most production inference workloads under ~70B parameters, H100 is now the value tier. It is plentiful, well-supported by every framework, and prices keep softening.

H200: the quiet workhorse

The H200 is essentially an H100 die paired with 141 GB of HBM3e memory and higher memory bandwidth. (NVIDIA H200 page) For LLM inference — which is bandwidth-bound, not compute-bound — the H200's larger HBM is often more useful than B200's raw FLOPS for models in the 70B–200B range.

H200 supply caught up faster than H100 ever did, and as B200 ramps the H200 market is softening too. For a team that wants Hopper headroom without B200 lead times, H200 is the practical pick in 2026. [Inference]

B200: shipping but constrained

The Blackwell B200 began shipping in late 2025 and through 2026 hyperscalers (AWS, GCP, Azure, Oracle) are rolling out instances. (gpu.fm guide)

The mechanical reasons B200 is hard to get:

  • Dual-die GB100 design — two large dies on TSMC's 4NP node, each with non-trivial yield drop on a large reticle.
  • CoWoS-L packaging capacity at TSMC is the bottleneck, not just wafers.
  • HBM3e supply (SK Hynix dominant, Micron and Samsung qualifying) is a separate constraint.
  • One reported figure — an estimated 3.6 million unit backlog as of April 2026 — captures the gap between order book and shipments. (gpu.fm) [Unverified — single-source estimate]

    For builders, the practical implication: cloud rental is the only realistic path to B200 inside a 30–60 day window. Buying hardware through OEMs (Supermicro, Dell, HPE) is a multi-quarter commitment, often gated by minimum-purchase agreements.

    What builders actually buy in 2026

    [Inference] Based on cloud-provider pricing pages and reported deal patterns:

    TierWorkloadPractical pick
    Inference, ≤70B modelMost production servingH100 (value) or H200 (memory headroom)
    Inference, 70B–200BLarger open models, MoEH200 or B200 (when available)
    Inference, 200B+ frontierSelf-hosted Llama-classB200 / GB200 NVL72 (rented)
    Training, fine-tuneLoRA, small SFTH100 or A100 — cheap and plentiful
    Training, frontierPre-training new modelsB200 / GB200 — multi-year commits

    Alternatives: AMD, Trainium, TPU

    The "NVIDIA-only" world is loosening at the edges:

  • AMD MI300X / MI325X — 192–256 GB HBM, attractive for very-large-model inference. ROCm support has matured but ecosystem is still narrower than CUDA.
  • AWS Trainium 2 / Inferentia 2 — competitive on $/inference for Bedrock-served workloads, especially Anthropic models on Trainium clusters.
  • Google TPU v5p / v5e / Trillium — strong for Gemini-class inference, available via Vertex AI.
  • Groq, Cerebras, SambaNova — specialized inference hardware. Groq leads on absolute tokens-per-second for select models. [Unverified]
  • For most teams in 2026, NVIDIA is still the default. The alternatives are credible enough that single-vendor lock-in is the more interesting risk than "alternatives don't work."

    Practical advice

  • Don't chase Blackwell unless you need it. H100 and H200 cover the bulk of production inference, and they are cheaper per token than B200 today even where B200 is available.
  • Spot tier is the cost lever. Spot/preemptible H100 is 40–60% off on-demand on most clouds. If your workload tolerates restart, it is the highest-leverage saving available.
  • Reserve only for predictable load. 1- and 3-year reservations get 30–60% off, but on a workload that may shift to a different hardware tier in 12 months you can over-commit yourself into a Blackwell-shaped problem.
  • Watch HBM, not chips. The actual scarcity in 2026 is not silicon — it is HBM3e packaging. That is what gates B200 supply, and it is what will gate B100 / H200 prices recovering. [Inference]
  • Bottom line

    The 2023 narrative — "you cannot get GPUs" — has become 2026's "you cannot get the newest GPUs at retail timelines." For most inference workloads, that does not matter. Hopper-tier is plentiful and cheap-er, and Blackwell rental is widely available even if Blackwell purchase is not. Plan capacity around what you can actually buy in your decision window, not around the press release.

    Frequently Asked Questions

    Is the H100 still worth buying in 2026?
    For inference workloads under ~70B parameters, the H100 is the price-performance sweet spot. The H200 is preferable when memory headroom matters; B200 is preferable when raw throughput at scale matters and you can secure capacity.
    How long is the wait for B200?
    Cloud rental is reportedly available now from early-allocation hyperscalers; direct hardware purchase via OEMs is generally a multi-quarter timeline gated by minimum purchase commitments. [Inference]
    Should I consider AMD or AWS Trainium as an alternative?
    For inference, both are credible in 2026, especially for cost-sensitive workloads. The ecosystem is narrower than CUDA, but if your workload runs on Bedrock or Vertex you may already be using non-NVIDIA silicon without thinking about it.

    Related Posts