Rate Limiting AI APIs: Strategies That Actually Work
Rate limiting an AI API is harder than rate limiting a regular API. A "request" can cost $0.0001 or $5.00 depending on prompt size, model, and output length. A noisy tenant can starve a paying tenant. An agent loop can fire 100 model calls per user action. The "100 requests per minute" rules from REST-API days do not survive contact with this workload.
Why traditional rate limiting fails for AI
A single AI agent request can cost 100× more than a typical human request, yet traditional rate limiters treat them all the same. (Zuplo on AI rate limiting)
The provider rate-limit metrics that matter in 2026:
OpenAI, Anthropic, and most providers publish all four. TPM is usually the binding constraint, not RPM, because token cost dwarfs per-request cost.
The four core algorithms
Token bucket
Each client has a "bucket" that fills at a fixed rate. Each request consumes tokens. Empty bucket → 429. The classic implementation; allows bursts up to the bucket size.
bucket_size = 1000 # max burst
refill_rate = 100/sec # steady stateToken bucket "often strikes the best balance, handling bursts effectively while maintaining overall traffic control." (api7.ai)
For AI APIs, the tokens in the algorithm should be input/output tokens, not request count. A 50K-token prompt consumes 50K tokens from the bucket; a 500-token prompt consumes 500.
Leaky bucket
Requests enter a queue at any rate; the queue drains at a fixed rate. Smooths bursts; can introduce latency for queued requests.
Fixed window
Count requests in the current minute/hour. Simple but vulnerable to edge bursts — 100 requests at 12:00:59.5 + 100 at 12:01:00.5 looks like 200 in two seconds even with a "100/min" limit.
Sliding window
Same idea as fixed window but with a rolling timer. Smoother behavior at window edges. The default for most modern systems.
Per-tenant fairness
In multi-tenant systems, per-IP rate limiting is wrong. Two customers behind the same NAT share the limit; one tenant on multiple machines bypasses it.
The right unit is per-tenant (or per-API-key, per-org, per-user — pick the granularity that matches billing). Each tenant gets their own bucket. (dreamfactory on multi-tenant)
A practical implementation:
# pseudocode
bucket_key = f"rate_limit:{tenant_id}:tpm"
allowed = redis_token_bucket(
key=bucket_key,
capacity=tier_limit(tenant.plan),
refill_per_sec=tier_limit(tenant.plan) / 60,
cost=estimated_input_tokens
)
if not allowed:
return 429Tier the limits by plan: free 10K TPM, starter 100K TPM, pro 1M TPM, enterprise custom.
Estimating cost before the request
The unique problem with AI APIs: you do not know the exact cost until after the response. You know the input tokens, but output tokens are bounded only by max_tokens.
Two approaches:
max_tokens — reserve input + max_tokens from the bucket; refund the difference after the response. Conservative; wastes capacity on short responses.input upfront, debit actual_output after. May briefly exceed limits if many requests have surprisingly long responses simultaneously.Most teams pick option 2 with a smaller secondary check on cumulative TPM over the last 60 seconds. [Inference]
Queue-based throttling
When a tenant hits the limit, two responses are possible:
Retry-After header. The client backs off and retries.For interactive UIs (a user typing in chat), reject is usually correct — the user should know there is a problem. For background batch jobs, queueing is reasonable.
429 handling on the client side
The client behavior matters as much as server behavior. Best practices:
Retry-After if the server sends it. Do not retry sooner.Retry-After — delay = base * 2^attempt + random(0, jitter). Jitter is critical to avoid synchronized retries that immediately re-spike the limiter.A common bug: a client retries on 429, the request did succeed (network blip on the response), and the action runs twice. Idempotency keys are the fix.
Observability for rate limits
Log per request:
Set alerts on:
Implementation at scale
State is the problem. A single-process token bucket is trivial; coordinating across 50 server instances is not. Common patterns (gravitee on scale):
Redis Cell extension) — atomic bucket updates from any instanceFor most teams, Redis with token-bucket Lua scripts is the default. It scales to many tens of thousands of buckets per second and is well-understood operationally.
Bottom line
Rate limiting AI APIs in 2026 is token-aware, per-tenant, and 429-friendly. Use token bucket on tokens (not requests), tier by plan, charge inputs upfront and reconcile outputs, and design clients to back off with jitter and idempotency keys. The result: noisy tenants do not starve quiet ones, the cost ceiling holds, and the system degrades gracefully under load instead of falling over.


