Rate Limiting AI APIs: Strategies That Actually Work

CallMissed
·6 min readGuide

Rate limiting an AI API is harder than rate limiting a regular API. A "request" can cost $0.0001 or $5.00 depending on prompt size, model, and output length. A noisy tenant can starve a paying tenant. An agent loop can fire 100 model calls per user action. The "100 requests per minute" rules from REST-API days do not survive contact with this workload.

Why traditional rate limiting fails for AI

A single AI agent request can cost 100× more than a typical human request, yet traditional rate limiters treat them all the same. (Zuplo on AI rate limiting)

The provider rate-limit metrics that matter in 2026:

  • RPM — requests per minute
  • TPM — tokens per minute (input + output)
  • TPD — tokens per day
  • IPM — images per minute (for multimodal)
  • OpenAI, Anthropic, and most providers publish all four. TPM is usually the binding constraint, not RPM, because token cost dwarfs per-request cost.

    The four core algorithms

    Token bucket

    Each client has a "bucket" that fills at a fixed rate. Each request consumes tokens. Empty bucket → 429. The classic implementation; allows bursts up to the bucket size.

    Code
    bucket_size = 1000        # max burst
    refill_rate = 100/sec     # steady state

    Token bucket "often strikes the best balance, handling bursts effectively while maintaining overall traffic control." (api7.ai)

    For AI APIs, the tokens in the algorithm should be input/output tokens, not request count. A 50K-token prompt consumes 50K tokens from the bucket; a 500-token prompt consumes 500.

    Leaky bucket

    Requests enter a queue at any rate; the queue drains at a fixed rate. Smooths bursts; can introduce latency for queued requests.

    Fixed window

    Count requests in the current minute/hour. Simple but vulnerable to edge bursts — 100 requests at 12:00:59.5 + 100 at 12:01:00.5 looks like 200 in two seconds even with a "100/min" limit.

    Sliding window

    Same idea as fixed window but with a rolling timer. Smoother behavior at window edges. The default for most modern systems.

    Per-tenant fairness

    In multi-tenant systems, per-IP rate limiting is wrong. Two customers behind the same NAT share the limit; one tenant on multiple machines bypasses it.

    The right unit is per-tenant (or per-API-key, per-org, per-user — pick the granularity that matches billing). Each tenant gets their own bucket. (dreamfactory on multi-tenant)

    A practical implementation:

    python
    # pseudocode
    bucket_key = f"rate_limit:{tenant_id}:tpm"
    allowed = redis_token_bucket(
        key=bucket_key,
        capacity=tier_limit(tenant.plan),
        refill_per_sec=tier_limit(tenant.plan) / 60,
        cost=estimated_input_tokens
    )
    if not allowed:
        return 429

    Tier the limits by plan: free 10K TPM, starter 100K TPM, pro 1M TPM, enterprise custom.

    Estimating cost before the request

    The unique problem with AI APIs: you do not know the exact cost until after the response. You know the input tokens, but output tokens are bounded only by max_tokens.

    Two approaches:

  • Pre-charge by max_tokens — reserve input + max_tokens from the bucket; refund the difference after the response. Conservative; wastes capacity on short responses.
  • Charge by input only, then debit output — charge input upfront, debit actual_output after. May briefly exceed limits if many requests have surprisingly long responses simultaneously.
  • Most teams pick option 2 with a smaller secondary check on cumulative TPM over the last 60 seconds. [Inference]

    Queue-based throttling

    When a tenant hits the limit, two responses are possible:

  • Reject with 429 + Retry-After header. The client backs off and retries.
  • Queue internally up to a max wait time, then either fulfill or 429. Smoother UX, but adds latency and operational complexity.
  • For interactive UIs (a user typing in chat), reject is usually correct — the user should know there is a problem. For background batch jobs, queueing is reasonable.

    429 handling on the client side

    The client behavior matters as much as server behavior. Best practices:

  • Respect Retry-After if the server sends it. Do not retry sooner.
  • Exponential backoff with jitter if no Retry-Afterdelay = base * 2^attempt + random(0, jitter). Jitter is critical to avoid synchronized retries that immediately re-spike the limiter.
  • Cap retries — typically 3–5. After that, surface the error.
  • Idempotency keys — for any request that mutates state, send an idempotency key. The server should recognize it on retry and not double-charge.
  • A common bug: a client retries on 429, the request did succeed (network blip on the response), and the action runs twice. Idempotency keys are the fix.

    Observability for rate limits

    Log per request:

  • Whether it was rate-limited (true / false)
  • Which limit was hit (TPM, RPM, per-tenant, global)
  • Tenant, plan, current bucket fill
  • Set alerts on:

  • Rate-limit error rate > 1% (something is misconfigured or a tenant is misbehaving)
  • One tenant consuming > 80% of global capacity (capacity planning signal)
  • Sudden spike in 429s (incident or attack)
  • Implementation at scale

    State is the problem. A single-process token bucket is trivial; coordinating across 50 server instances is not. Common patterns (gravitee on scale):

  • Redis with Lua (or Redis Cell extension) — atomic bucket updates from any instance
  • Consistent hashing — route a tenant's requests to the same instance, which holds local state
  • Sliding window counter — clock-tolerant, low memory, works with Redis
  • For most teams, Redis with token-bucket Lua scripts is the default. It scales to many tens of thousands of buckets per second and is well-understood operationally.

    Bottom line

    Rate limiting AI APIs in 2026 is token-aware, per-tenant, and 429-friendly. Use token bucket on tokens (not requests), tier by plan, charge inputs upfront and reconcile outputs, and design clients to back off with jitter and idempotency keys. The result: noisy tenants do not starve quiet ones, the cost ceiling holds, and the system degrades gracefully under load instead of falling over.

    Frequently Asked Questions

    Should I rate limit on requests or tokens?
    Tokens, almost always. AI requests vary 100× in cost; per-request limits either over-provision (wastes capacity) or under-protect (a single huge request blows your budget). Token-based limits track actual cost.
    How do I limit per tenant when I don't know cost until the response?
    Reserve max_tokens upfront, refund the difference after. Or charge by input upfront and debit output as the response arrives. Both are practical; the first is more conservative.
    What should clients do on a 429?
    Respect Retry-After if present; otherwise exponential backoff with jitter. Cap retries at 3–5. Always send an idempotency key on mutating requests to safely retry without double-execution.

    Related Posts