Agent Observability: Tracing Tool Calls End-to-End

CallMissed
·5 min readGuide

You will not debug an agent from logs. The reasoning chain is too branched, the latency surface too rich, and the failure modes too non-local. What you need is a trace — a tree-structured record of every LLM call, tool invocation, retrieval, and decision boundary, with timing and content attached. The good news is that 2026's tooling, both open-source and vendor, has converged on a clean enough standard that this is no longer a special-case build.

What a good agent trace looks like

A useful agent trace is a tree, not a list:

  • Root span: the user request
  • Child spans: each LLM call, each tool invocation, each retrieval, each agent boundary in a multi-agent system
  • Attributes per span: model name, input tokens, output tokens, latency, cost, prompt version
  • Events per span: the prompt content, the response content, intermediate tool arguments
  • You should be able to load any production trace and answer: which step took the time? which step burned the tokens? which step produced the wrong intermediate output? without grepping any logs.

    OpenTelemetry GenAI as the standard

    The OpenTelemetry GenAI semantic conventions are the convergence point in 2026. They cover four areas:

  • LLM client spans — every model call, with gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens
  • Agent spans — agent boundaries with gen_ai.agent.name, gen_ai.agent.id, gen_ai.operation.name
  • Events — prompt and completion content, kept off span attributes for size and PII reasons
  • Metrics — aggregate counters and histograms for usage, errors, and latency
  • Major vendors including Datadog, Honeycomb, and New Relic already support these conventions, and the major frameworks (LangChain, CrewAI, AutoGen) emit OTel-compliant spans natively or through instrumentation packages.

    Where to put prompt content

    This is the trap most teams hit. The instinct is to put the full prompt and completion as span attributes. The problem: attributes are always indexed, have size limits, and expose PII directly to your observability backend.

    The GenAI conventions answer: use span events for prompt and completion content. Events can be sampled, redacted, or dropped at the OpenTelemetry Collector level without touching application code. Sensitive fields stay off the indexed attribute set.

    Practical setup:

  • Span attributes for sizes and metadata (token counts, model, latency, cost)
  • Span events for the actual content (gen_ai.user.message, gen_ai.assistant.message, gen_ai.tool.message)
  • Collector-level redaction rules that strip PII before the trace ships to your backend
  • Backends in 2026

    The serious options:

    Open-source / self-hostable

  • Langfuse — open-source under MIT, self-hostable on ClickHouse. Good for teams that want full data ownership and existing OTel pipelines.
  • Phoenix (Arize) — open-source, focused on LLM tracing and evals. Strong eval integration.
  • Helicone — open-source proxy plus hosted backend. Lower setup tax than Langfuse; less self-host story.
  • Jaeger v2 — adopted OpenTelemetry at its core in 2025 specifically to handle AI agent traces; pure tracing without LLM-specific evaluation features.
  • Hosted / closed-source

  • Datadog LLM Observability — natively supports GenAI semantic conventions. Strong fit when you already have a Datadog account.
  • Braintrust — observability tightly bundled with evals and PR-level CI gating
  • LangSmith — best in class for LangChain / LangGraph stacks; tighter coupling than vendor-neutral options
  • Native to your model provider

    OpenAI, Anthropic, and Google each ship dashboards for their own API usage. Useful for billing and quick checks; rarely sufficient as your primary observability — you can't see your own tools, retrievals, or agent boundaries from there.

    What to instrument first

    If you're starting from zero, instrument in this order:

  • Every LLM call. Provider, model, input tokens, output tokens, latency, cost.
  • Every tool call. Tool name, arguments (redacted if sensitive), latency, success / error.
  • Every retrieval. Index, query, result count, top-k IDs.
  • Agent boundaries. Which agent owned each turn; for handoffs, who handed off to whom.
  • User-facing latency. First-token-time, last-token-time, end-to-end wall clock.
  • This is enough to answer the 80% of questions that will come up in production debugging.

    Sampling and cost

    Tracing every span at full fidelity in production is expensive. Two patterns:

  • Tail-based sampling. Keep all error traces, slow traces, and a small percentage of success traces. The OTel Collector ships this out of the box.
  • Content sampling. Always sample span structure; sample prompt/completion events at lower rates (1–5%). You keep the latency picture; you don't keep every token.
  • For evals and replay, you may want richer retention on a tagged subset (e.g., all traces from a specific feature flag or staging environment).

    Anti-patterns

  • No span for tool calls. You see the LLM call but not what the tool actually did. Useless when the tool is the failure source.
  • Logs instead of traces. A flat log file can't show parallel branches or call hierarchies. Print logs alongside traces, not instead of them.
  • Vendor lock by SDK. Picking a vendor SDK that emits proprietary span shapes makes migration painful. Emit OTel; route to your backend of choice.
  • Frequently Asked Questions

    Do I need OpenTelemetry if my framework already has tracing?
    [Inference] Yes — frameworks like LangSmith are great for their stack but vendor-locked. Emitting OTel-compliant spans at the framework boundary lets you pipe to multiple backends and migrate later without re-instrumenting.
    What's the cheapest path to agent observability?
    Self-hosted Langfuse or Phoenix on a small VM. Both are free under MIT-style licenses and run on commodity infrastructure. The labor cost (operating ClickHouse, scaling) is the real expense, not licenses.
    How do I avoid leaking PII through traces?
    Put prompt and completion content in span events (not attributes), redact at the OpenTelemetry Collector before shipping to your backend, and sample content events at a lower rate than structural spans. Treat traces with the same data-handling rigor as production logs.

    Related Posts