Tool Use Design Patterns for AI Agents

CallMissed
·5 min readGuide

The single biggest determinant of agent quality is not the model — it's the tools. A capable model with badly designed tools wanders, retries, hallucinates parameters, and burns tokens. A weaker model with well-shaped tools often outperforms it. Tool design has accumulated a stable set of patterns; here are the ones that actually move the needle in production.

Granularity: aim for the agent's mental unit, not the API endpoint

The most common anti-pattern is exposing your REST API one-to-one as tools. The agent now needs to call get_orders, get_order_items, and get_customer to answer "what did Anuj buy last week" — three round-trips, three opportunities to drop a parameter. A single search_orders(customer, since, status?) tool is a stronger primitive.

Rule of thumb: a tool should map to a meaningful agent action, not a low-level resource. If you find yourself documenting "first call X, then Y, then Z" in the system prompt, you have the wrong granularity.

The opposite extreme is also a trap: a single mega-tool with 30 optional parameters lets the model pick the wrong combination silently. Aim for 5–20 tools, each doing one thing the model can describe in a sentence.

Idempotency: assume the agent will retry

Models retry. Frameworks retry. Networks retry. Treat every tool that mutates state as if it will be called twice for the same logical operation:

  • Accept an optional idempotency_key parameter
  • Return the same result for the same key
  • Document the behavior in the tool description so the model knows it's safe
  • Stripe popularized this for HTTP APIs and it transfers directly. The cost of an extra parameter is tiny; the cost of a duplicate charge or duplicate email is not.

    Structured returns: never just stringify

    Returning "User created: id=123" looks fine until the next agent step needs to use the ID and tries to parse it from the string. Always return structured content:

    json
    {
      "ok": true,
      "user_id": 123,
      "email": "anuj@example.com",
      "created_at": "2026-05-09T12:00:00Z"
    }

    In MCP and most function-calling protocols you can mix text and structured payloads. Use both: a one-line human-readable summary plus the structured data. The model uses the summary for chain-of-thought; downstream tools use the data.

    Error shapes: distinguish input errors from system errors

    A tool error is information the model can act on. The shape matters:

  • Input error ("invalid customer_id format") → model can reformulate and retry
  • Not found ("customer not in this tenant") → model should ask the user or pick a different path
  • System error ("database timeout") → model should back off; do not loop
  • Return a typed error code, not a stack trace. The classic mistake is to surface the upstream exception verbatim; the model treats "ConnectionError: HTTPConnectionPool(host=..." as content to reason about and produces nonsense.

    Scope permissions per call

    A delete_user tool that can target any user in any tenant is a security incident waiting to happen. Permission scoping should sit at the tool layer, not the prompt:

  • Pass tenant / user context implicitly (from the agent's auth, not the model's arguments)
  • Validate at call time, not in the description
  • Return 403-style errors that the model can show but not bypass
  • The general rule: never trust the model to enforce a permission. The system prompt is not a security boundary.

    Tool descriptions are prompts

    Every tool description is part of your system prompt. Treat them with the same rigor:

  • One-sentence purpose at the top
  • A note on when not to use the tool
  • Each parameter described, including units and accepted formats
  • Example call (and example response, when it clarifies)
  • The Anthropic and OpenAI tool-use guides both emphasize this. Generic descriptions ("Search the database") produce generic calls; specific descriptions ("Search recent orders by customer email or ID. Returns up to 50 most recent. Use this when the user references a past purchase.") produce focused calls.

    Latency budgets per tool

    Every tool call adds to the user's perceived latency. Two practical rules:

  • Set a per-tool timeout (e.g., 5s for reads, 15s for writes) and return a typed timeout error
  • Stream long-running tools. MCP supports progress notifications; the OpenAI Responses API supports tool streaming; both let the model show "still working..." without hanging the conversation
  • Aim to keep p95 tool latency under 1 second for hot-path agents. Slow tools should be moved off-path (background jobs, async polling) rather than blocking the loop.

    A short anti-pattern catalog

  • String-typed enums. "Pass status as one of: open, closed, archived" — but the model passes "OPEN" or "open " or "active". Use a real enum schema; let validation reject and surface the canonical list.
  • Tools that need the model to remember state. Stateful tools that depend on a prior call's hidden output break under retries. Make every call self-contained.
  • Read tools that mutate. "search" should not log a search history record that's visible to other tools. Surprises kill agent reliability.
  • Tools the model can't tell apart. Two tools with overlapping descriptions cause coin-flip routing. Keep names and descriptions disjoint.
  • Frequently Asked Questions

    How many tools is too many for one agent?
    [Inference] Most published guidance lands around 20 as a soft ceiling per agent. Past that, routing accuracy degrades and you're better served by splitting into multiple agents with handoffs.
    Should every tool return JSON or is text fine?
    Mix both. Return a short human-readable summary plus a structured payload. The model uses the summary for reasoning; downstream tools and your code use the structured data. Pure-text returns force re-parsing and lose precision.
    How do I stop the model from hallucinating tool parameters?
    Use strict JSON Schemas with explicit enums and constraints, validate aggressively, and return a clear typed error listing the accepted values. Verbose tool descriptions with examples also reduce the failure rate.

    Related Posts