Rate Limiting AI APIs: Strategies That Actually Work

CallMissedMay 8, 2026

·6 min readGuide

Rate Limiting API Design Multi-Tenancy Reliability

Rate limiting an AI API is harder than rate limiting a regular API. A "request" can cost $0.0001 or $5.00 depending on prompt size, model, and output length. A noisy tenant can starve a paying tenant. An agent loop can fire 100 model calls per user action. The "100 requests per minute" rules from REST-API days do not survive contact with this workload.

Why traditional rate limiting fails for AI

A single AI agent request can cost 100× more than a typical human request, yet traditional rate limiters treat them all the same. (Zuplo on AI rate limiting)

The provider rate-limit metrics that matter in 2026:

RPM — requests per minute

TPM — tokens per minute (input + output)

TPD — tokens per day

IPM — images per minute (for multimodal)

OpenAI, Anthropic, and most providers publish all four. TPM is usually the binding constraint, not RPM, because token cost dwarfs per-request cost.

The four core algorithms

Token bucket

Each client has a "bucket" that fills at a fixed rate. Each request consumes tokens. Empty bucket → 429. The classic implementation; allows bursts up to the bucket size.

Code

bucket_size = 1000        # max burst
refill_rate = 100/sec     # steady state

Token bucket "often strikes the best balance, handling bursts effectively while maintaining overall traffic control." (api7.ai)

For AI APIs, the tokens in the algorithm should be input/output tokens, not request count. A 50K-token prompt consumes 50K tokens from the bucket; a 500-token prompt consumes 500.

Leaky bucket

Requests enter a queue at any rate; the queue drains at a fixed rate. Smooths bursts; can introduce latency for queued requests.

Fixed window

Count requests in the current minute/hour. Simple but vulnerable to edge bursts — 100 requests at 12:00:59.5 + 100 at 12:01:00.5 looks like 200 in two seconds even with a "100/min" limit.

Sliding window

Same idea as fixed window but with a rolling timer. Smoother behavior at window edges. The default for most modern systems.

Per-tenant fairness

In multi-tenant systems, per-IP rate limiting is wrong. Two customers behind the same NAT share the limit; one tenant on multiple machines bypasses it.

The right unit is per-tenant (or per-API-key, per-org, per-user — pick the granularity that matches billing). Each tenant gets their own bucket. (dreamfactory on multi-tenant)

A practical implementation:

python

# pseudocode
bucket_key = f"rate_limit:{tenant_id}:tpm"
allowed = redis_token_bucket(
    key=bucket_key,
    capacity=tier_limit(tenant.plan),
    refill_per_sec=tier_limit(tenant.plan) / 60,
    cost=estimated_input_tokens
)
if not allowed:
    return 429

Tier the limits by plan: free 10K TPM, starter 100K TPM, pro 1M TPM, enterprise custom.

Estimating cost before the request

The unique problem with AI APIs: you do not know the exact cost until after the response. You know the input tokens, but output tokens are bounded only by max_tokens.

Two approaches:

Pre-charge by max_tokens — reserve input + max_tokens from the bucket; refund the difference after the response. Conservative; wastes capacity on short responses.

Charge by input only, then debit output — charge input upfront, debit actual_output after. May briefly exceed limits if many requests have surprisingly long responses simultaneously.

Most teams pick option 2 with a smaller secondary check on cumulative TPM over the last 60 seconds. [Inference]

Queue-based throttling

When a tenant hits the limit, two responses are possible:

Reject with 429 + Retry-After header. The client backs off and retries.

Queue internally up to a max wait time, then either fulfill or 429. Smoother UX, but adds latency and operational complexity.

For interactive UIs (a user typing in chat), reject is usually correct — the user should know there is a problem. For background batch jobs, queueing is reasonable.

429 handling on the client side

The client behavior matters as much as server behavior. Best practices:

Respect Retry-After if the server sends it. Do not retry sooner.

Exponential backoff with jitter if no Retry-After — delay = base * 2^attempt + random(0, jitter). Jitter is critical to avoid synchronized retries that immediately re-spike the limiter.

Cap retries — typically 3–5. After that, surface the error.

Idempotency keys — for any request that mutates state, send an idempotency key. The server should recognize it on retry and not double-charge.

A common bug: a client retries on 429, the request did succeed (network blip on the response), and the action runs twice. Idempotency keys are the fix.

Observability for rate limits

Log per request:

Whether it was rate-limited (true / false)

Which limit was hit (TPM, RPM, per-tenant, global)

Tenant, plan, current bucket fill

Set alerts on:

Rate-limit error rate > 1% (something is misconfigured or a tenant is misbehaving)

One tenant consuming > 80% of global capacity (capacity planning signal)

Sudden spike in 429s (incident or attack)

Implementation at scale

State is the problem. A single-process token bucket is trivial; coordinating across 50 server instances is not. Common patterns (gravitee on scale):

Redis with Lua (or Redis Cell extension) — atomic bucket updates from any instance

Consistent hashing — route a tenant's requests to the same instance, which holds local state

Sliding window counter — clock-tolerant, low memory, works with Redis

For most teams, Redis with token-bucket Lua scripts is the default. It scales to many tens of thousands of buckets per second and is well-understood operationally.

Bottom line

Rate limiting AI APIs in 2026 is token-aware, per-tenant, and 429-friendly. Use token bucket on tokens (not requests), tier by plan, charge inputs upfront and reconcile outputs, and design clients to back off with jitter and idempotency keys. The result: noisy tenants do not starve quiet ones, the cost ceiling holds, and the system degrades gracefully under load instead of falling over.

Frequently Asked Questions

Should I rate limit on requests or tokens?

Tokens, almost always. AI requests vary 100× in cost; per-request limits either over-provision (wastes capacity) or under-protect (a single huge request blows your budget). Token-based limits track actual cost.

How do I limit per tenant when I don't know cost until the response?

Reserve max_tokens upfront, refund the difference after. Or charge by input upfront and debit output as the response arrives. Both are practical; the first is more conservative.

What should clients do on a 429?

Respect Retry-After if present; otherwise exponential backoff with jitter. Cap retries at 3–5. Always send an idempotency key on mutating requests to safely retry without double-execution.