You will not debug an agent from logs. The reasoning chain is too branched, the latency surface too rich, and the failure modes too non-local. What you need is a trace — a tree-structured record of every LLM call, tool invocation, retrieval, and decision boundary, with timing and content attached. The good news is that 2026's tooling, both open-source and vendor, has converged on a clean enough standard that this is no longer a special-case build.
What a good agent trace looks like
A useful agent trace is a tree, not a list:
Root span: the user request
Child spans: each LLM call, each tool invocation, each retrieval, each agent boundary in a multi-agent system
Attributes per span: model name, input tokens, output tokens, latency, cost, prompt version
Events per span: the prompt content, the response content, intermediate tool arguments
You should be able to load any production trace and answer: which step took the time? which step burned the tokens? which step produced the wrong intermediate output? without grepping any logs.
This is the trap most teams hit. The instinct is to put the full prompt and completion as span attributes. The problem: attributes are always indexed, have size limits, and expose PII directly to your observability backend.
The GenAI conventions answer: use span events for prompt and completion content. Events can be sampled, redacted, or dropped at the OpenTelemetry Collector level without touching application code. Sensitive fields stay off the indexed attribute set.
Practical setup:
Span attributes for sizes and metadata (token counts, model, latency, cost)
Span events for the actual content (gen_ai.user.message, gen_ai.assistant.message, gen_ai.tool.message)
Collector-level redaction rules that strip PII before the trace ships to your backend
Backends in 2026
The serious options:
Open-source / self-hostable
Langfuse — open-source under MIT, self-hostable on ClickHouse. Good for teams that want full data ownership and existing OTel pipelines.
Phoenix (Arize) — open-source, focused on LLM tracing and evals. Strong eval integration.
Helicone — open-source proxy plus hosted backend. Lower setup tax than Langfuse; less self-host story.
Jaeger v2 — adopted OpenTelemetry at its core in 2025 specifically to handle AI agent traces; pure tracing without LLM-specific evaluation features.
Hosted / closed-source
Datadog LLM Observability — natively supports GenAI semantic conventions. Strong fit when you already have a Datadog account.
Braintrust — observability tightly bundled with evals and PR-level CI gating
LangSmith — best in class for LangChain / LangGraph stacks; tighter coupling than vendor-neutral options
Native to your model provider
OpenAI, Anthropic, and Google each ship dashboards for their own API usage. Useful for billing and quick checks; rarely sufficient as your primary observability — you can't see your own tools, retrievals, or agent boundaries from there.
What to instrument first
If you're starting from zero, instrument in this order:
This is enough to answer the 80% of questions that will come up in production debugging.
Sampling and cost
Tracing every span at full fidelity in production is expensive. Two patterns:
Tail-based sampling. Keep all error traces, slow traces, and a small percentage of success traces. The OTel Collector ships this out of the box.
Content sampling. Always sample span structure; sample prompt/completion events at lower rates (1–5%). You keep the latency picture; you don't keep every token.
For evals and replay, you may want richer retention on a tagged subset (e.g., all traces from a specific feature flag or staging environment).
Anti-patterns
No span for tool calls. You see the LLM call but not what the tool actually did. Useless when the tool is the failure source.
Logs instead of traces. A flat log file can't show parallel branches or call hierarchies. Print logs alongside traces, not instead of them.
Vendor lock by SDK. Picking a vendor SDK that emits proprietary span shapes makes migration painful. Emit OTel; route to your backend of choice.
Frequently Asked Questions
Do I need OpenTelemetry if my framework already has tracing?
[Inference] Yes — frameworks like LangSmith are great for their stack but vendor-locked. Emitting OTel-compliant spans at the framework boundary lets you pipe to multiple backends and migrate later without re-instrumenting.
What's the cheapest path to agent observability?
Self-hosted Langfuse or Phoenix on a small VM. Both are free under MIT-style licenses and run on commodity infrastructure. The labor cost (operating ClickHouse, scaling) is the real expense, not licenses.
How do I avoid leaking PII through traces?
Put prompt and completion content in span events (not attributes), redact at the OpenTelemetry Collector before shipping to your backend, and sample content events at a lower rate than structural spans. Treat traces with the same data-handling rigor as production logs.