The single most expensive line in any agent product is the bill from the day a loop ran free. Not the slow accumulation of normal usage — the one Tuesday when a tool retry got into a state where a single conversation called the model 412 times and burned through what was supposed to be a month of margin. Cost budgeting is not a finance concern in agent systems; it is a production-reliability primitive.
Why agents go off the rails
A few common failure modes:
Retry loops. A tool returns a malformed result the model can't parse; the model retries the same tool with the same arguments; repeat.
Looping on missing context. The model decides it needs to re-read 30 documents on every turn instead of remembering what it read last turn.
Pathological self-critique. A generator-critic loop where the critic is too pessimistic and the generator never satisfies it.
Context bloat compounding. Each turn the prompt gets longer, each turn costs more, and the agent doesn't notice.
Bad eval sets in CI. A dev-mode loop runs against the API, not a stub, and quietly burns the corporate card overnight.
None of these are model bugs. They are system bugs that cost money because models are metered.
The four budget layers
A reasonable agent budget stack has four levels:
Per-call cap — max tokens, max output, max model size for this single LLM call
Per-step cap — total tokens across one tool invocation + LLM round-trip
Per-conversation cap — total dollars across a full session
Per-tenant cap — total dollars across a billing period for one customer
Each is enforced at a different point. Per-call by your model client. Per-step by your tool harness. Per-conversation by the agent loop. Per-tenant by your billing layer.
Per-call: max_tokens is not enough
Most engineers cap max_tokens and call it done. That bounds the output; it does nothing about the input. A 200K-token context with max_tokens=512 is still an expensive call.
Better:
Cap input tokens explicitly. Truncate or summarize before the call if you're over budget.
Cap output tokens to the smallest value that still fits the task.
Pick the smallest model that meets quality. A Haiku call that's 90% as good as Sonnet at 1/10 the cost is usually the right answer for routing, classification, and short-form work.
Use prompt caching aggressively for stable system prompts and large repeated context. Anthropic's prompt cache and OpenAI's cached input both deliver 50–90% cost reduction on cache hits. [Inference]
Per-step: bound the loop iteration
A single agent step (LLM call + tool invocation + result back to LLM) should have an enforced ceiling. The pattern that holds up:
Hard token cap (input + output) per step
Hard wall-clock cap per step (e.g., 30s)
On exceed → return a typed error to the parent agent loop, not a retry
Without per-step caps, one slow tool call cascades into a multi-minute timeout that costs the user latency and costs you tokens for the regenerated context.
Per-conversation: the user-facing budget
This is where most $100 loops are caught:
Track cumulative cost on the conversation object
After each step, check if cost > budget: stop
Surface the budget to the user when they're close ("This conversation is approaching its cost budget; would you like to continue?")
For autonomous tasks, stop hard. Don't ask, just halt and report.
A reasonable budget for a chat-style agent is $0.10–$0.50 per conversation. For autonomous coding agents it can be $5–$20 per task. Whatever the number, it should be a config, not a hard-coded constant, and it should be plumbed everywhere.
Per-tenant: the billing fence
Per-tenant caps catch the case where one customer's runaway agents would blow up your unit economics. Three components:
Soft cap — alert when 80% of monthly budget is consumed
Hard cap — refuse new agent runs above 100%
Burst cap — refuse if cost-per-hour exceeds a threshold even within the monthly budget
The burst cap is what you wish you had on the day a customer's misconfigured cron job DDOS'd your bill.
Exponential backoff is not optional
When a model API returns a 429 or 5xx, the wrong reaction is "retry immediately." The right reaction is exponential backoff with jitter:
1st retry: wait 1–2s
2nd retry: wait 2–4s
3rd retry: wait 4–8s
After 3 retries: give up, return an error
Without backoff, a transient provider issue becomes a self-inflicted thundering herd that costs you tokens for retries the provider was already failing.
Alerting
Three alerts cover the majority of cost incidents:
Sudden cost rate change. If the dollars-per-hour rate jumps 5× without a corresponding traffic increase, page someone.
Per-conversation cost outlier. If a single conversation costs more than 3× the p99 of recent history, log loudly and consider halting.
Token-per-turn drift. If average tokens per turn is climbing week over week without product changes, you have context bloat or a memory layer that's not pruning.
A starter checklist
If you ship an agent in production tomorrow, these are the non-negotiables:
[ ] Per-call max_tokens and max_input_tokens enforced
[ ] Per-step token + wall-clock ceiling
[ ] Per-conversation budget that hard-stops the loop
[ ] Per-tenant monthly budget with soft and hard caps
[ ] Exponential backoff with jitter on all model API calls
[ ] Cost rate alert on a dashboard
[ ] Per-conversation outlier alert
None of this is glamorous. All of it pays for itself the first time a loop tries to run free.
Frequently Asked Questions
What's a reasonable per-conversation budget for a chat agent?
[Inference] $0.05–$0.50 covers most chat-shaped agents in 2026, depending on context size and how aggressively you cache. Autonomous coding tasks run higher — $5–$20 per task is a common ceiling.
Should I cap by tokens or by dollars?
Cap by both. Tokens are deterministic and cheap to track in the loop; dollars are what your finance team cares about. Use tokens for tactical per-call enforcement, dollars for conversation- and tenant-level budgets.
How do I budget when costs vary 10× between models?
Pick the smallest model that meets quality on your eval, route higher-cost models only when the smaller one fails, and always log the model used per call so you can attribute cost shifts. Dynamic routing without observability is how cost runaways start.