AI Inference Cost Optimization: Practical Wins

CallMissed
·6 min readGuide
Cover image: AI Inference Cost Optimization: Practical Wins
Cover image: AI Inference Cost Optimization: Practical Wins

The first AI bill is small. The second is a surprise. The third is a meeting. By 2026 most production AI workloads have left the toy budget behind, and the gap between teams that "do something about cost" and teams that do not is now measured in factors of 5–10x. The good news: most of the wins come from a small handful of well-understood techniques.

1. Prompt caching: the highest-leverage move

1. Prompt caching: the highest-leverage move
1. Prompt caching: the highest-leverage move

Prompt caching reuses the prefill compute of a previously seen prompt prefix. If your system prompt is 4,000 tokens and you call it 10,000 times a day, you are paying full price for the same prefix 9,999 times.

Anthropic's prompt caching charges 1.25× normal input price on a 5-minute cache write and 0.10× on a cache read; the 1-hour cache is 2× write and 0.10× read. Breakeven is just two cache hits for the long-duration cache. (Anthropic docs)

OpenAI handles caching automatically with cache reads at roughly 0.25–0.50× input price depending on the model. (OpenRouter docs)

The pattern that ships: put your stable system prompt + few-shot examples + tool descriptions in the cached prefix; put the user's variable input outside the cache breakpoint. A reported case: $720/month → $72/month after caching was applied to a high-volume agentic workload. (Medium case study) [Unverified — anecdote]

2. Model cascading

2. Model cascading
2. Model cascading

Not every request needs the smartest model. Cascading routes simple queries to small models and only escalates when needed.

The classic FrugalGPT result reported 50–98% cost reduction while matching or exceeding GPT-4 accuracy by escalating only on low-confidence answers. (summary via aitoolsbusiness) [Unverified — research paper, results context-specific]

Practical patterns:

  • Tiered routing — Haiku/Mini for triage and classification, Sonnet/4o for reasoning, Opus/o1 for the hardest cases.
  • Confidence escalation — if the small model returns "I'm not sure" or low log-probabilities, retry on the bigger model.
  • Self-consistency check — small model answers twice; if answers diverge, escalate.
  • The right cascade depends on your workload. Measure escalation rate; if it sits above ~30%, the small model is too small for your task.

    3. Continuous batching (if you self-host)

    If you serve your own model, continuous batching is the single largest GPU-utilization win. Static batching forces every request in a batch to wait for the slowest one; continuous batching schedules new requests into the GPU as soon as a slot opens.

    vLLM's continuous batching is reported to lift GPU utilization from ~15–30% (naïve serving) to ~60–80%, with 3–4× higher effective throughput at the same GPU cost. (Hakia) [Unverified — synthesis from multiple practitioner sources]

    If you are on managed APIs, this happens for you. If you self-host, run vLLM, SGLang, or TensorRT-LLM with continuous batching turned on; do not roll your own scheduler.

    4. Smaller, fine-tuned models for hot paths

    For repetitive workloads — classification, extraction, summarization with a fixed schema — a small fine-tuned model often matches a frontier API at 5–20× lower cost.

    Pattern:

  • Run the frontier model on the workload for a week to generate ground-truth examples.
  • Fine-tune (LoRA on a 7B–14B open model, or hosted fine-tune on a small closed model) on those examples.
  • Route the workload to the fine-tuned model; keep the frontier model as fallback.
  • The economics flip when daily call volume crosses ~50K–500K depending on prompt size. [Inference]

    5. Output structure and length

    The cheapest token is the one you do not generate. Concrete tactics:

  • Structured output (JSON schema, function calling) replaces "explain in prose" outputs that bloat token counts 3–5x.
  • max_tokens discipline — set it to the actual ceiling, not the model max.
  • Stop sequences — terminate generation as soon as the answer is complete.
  • Compression in chains — for multi-step pipelines, summarize intermediate steps before passing them forward.
  • 6. Observability is the prerequisite

    You cannot optimize what you cannot see. Log per-request:

  • Model used, prompt tokens, completion tokens, cached tokens, total cost
  • Latency (TTFT, total)
  • Tenant / user / feature attribution
  • Cache hit/miss
  • Roll those up by feature and by model. The 80/20 will surface within a week — usually one feature is 60% of cost, and one prompt template is 40% of that feature's tokens. Optimize the head of the distribution; ignore the tail.

    7. Batch APIs for non-realtime work

    Both OpenAI and Anthropic offer batch APIs at roughly 50% off for asynchronous workloads with a longer SLA (typically up to 24 hours). For overnight enrichment, embeddings backfill, eval runs, and offline analytics, batch is a nearly free 2× discount you should be using by default for everything that does not need a synchronous response. [Unverified — pricing accurate as of early 2026]

    A worked example

    A worked example
    A worked example

    A team running a customer-support copilot at $14,000/month:

    StepSavingNew monthly
    Baseline (GPT-4-class for everything)$14,000
    Cascade triage to Haiku/Mini35%$9,100
    Prompt caching for system prompt40%$5,460
    Tighten output schema, drop max_tokens15%$4,640
    Move analytics enrichment to batch API20%$3,710

    [Speculation] Numbers above are illustrative — your mileage will vary based on workload mix — but the order of wins (cascade → caching → output discipline → batch) is consistent across the teams I've seen reported.

    The real risk in 2026: agentic compounding

    Agents call the model in loops. A single user request can trigger 20–50 model calls. If you do not track tokens per task (not just per call), an agentic feature can quietly become your largest cost line item between billing cycles. Set per-task token budgets. Alert when they breach.

    Bottom line

    The cheapest AI workload is one where caching covers the static part of your prompt, the right-sized model covers the easy queries, the frontier model covers the hard ones, the output schema is tight, and batch APIs absorb anything that can wait an hour. Everything else is variations on those five themes.

    Frequently Asked Questions

    How much can prompt caching actually save?
    For workloads with a long stable system prompt called many times per minute, 50–90% of input cost is realistic. The math depends entirely on your cache hit rate.
    Should I self-host to save money?
    Usually no, until volume crosses ~50–200M tokens per month. Below that threshold, managed APIs are cheaper after engineering and ops cost. Above it, the math starts to favor self-hosting if you have the team.
    What's the single biggest mistake teams make?
    Routing every request to the frontier model. A two-tier cascade (small for triage, large for reasoning) typically cuts cost 30–60% with negligible quality loss when the small model is competent at your task.

    Related Posts