Evaluating Voice Agents: Beyond Word Error Rate
Word Error Rate is the most-quoted metric in voice AI and the least useful for evaluating actual voice agents. WER measures STT accuracy on transcribed audio. It tells you nothing about whether your agent answered the user's question, finished the task, sounded natural, or kept the conversation alive. In 2026, the teams that ship better voice agents are the ones that have moved beyond WER.
What WER actually measures
WER counts substitutions, insertions, and deletions in the transcript versus a reference. It's a clean, automatable, decades-old metric. It's also a single-component measure of one stage in a multi-stage pipeline.
A voice agent can have:
What separates them is everything that happens after the transcript.
What actually matters: conversation success rate
The single highest-leverage metric for a voice agent is conversation success rate — the percentage of conversations that ended in the user's intent being served.
Definition matters. For an appointment-booking agent, success is "appointment scheduled correctly." For a tier-1 support agent, success is "user's question answered or correctly escalated." For a qualification agent, success is "qualified lead's data captured accurately."
Measuring this requires:
A good production target is 80%+ on simple use cases and 60%+ on complex ones. [Inference] Below those numbers, the user experience pain outweighs the cost savings.
Latency metrics that matter
Three latency numbers worth tracking:
Measure p50, p95, and p99. The tail matters in voice — one slow turn ruins the whole call. Per Deepgram's latency guide, latency is dominated by what happens after audio capture, not the network alone.
Interruption metrics
Two specific numbers to track:
These are usually invisible to standard monitoring. You have to instrument them explicitly. The audit comes from sampling real conversations and tagging.
Customer-success eval
For business-critical agents, customer success rates per cohort matter more than per-call metrics. Track:
These are slow-moving metrics — measured over weeks, not minutes — but they're the ones that determine whether the agent stays deployed.
Regression testing for voice
When you change a prompt, swap a model, or upgrade a TTS, what breaks? Voice agents need regression tests, but the format is unusual:
A practical setup is 100 scenarios that exercise the main intents plus 30 known edge cases (interruptions, code-switching, repair, escalation triggers). Run them on every change. Compare success rates and latency to the previous version.
What synthetic eval misses
Synthetic evaluation has limits:
Treat synthetic eval as a regression-prevention tool. Treat real-call sampling as the ground truth. The teams that ship best run both.
Who does the labeling
Three options for labeling conversations:
Calibrate the LLM judge against human judgment quarterly. Drift is real.
Vendors and platforms
A few platforms in 2026 specialize in voice agent evaluation: Coval, Hamming, and others have built around scenario replay, LLM-judged grading, and real-call sampling. Coval's 2026 conversation with Pipecat and Ultravox covers the state of voice eval at the practitioner level.
For early-stage teams, a homegrown setup with scripted scenarios and an LLM grader is enough. Adopt a platform when scenario count crosses a few hundred or when human raters are the bottleneck.
A pragmatic eval stack
For a 2026 voice agent in production:
Total time investment: ~10 hours/week of human attention. Total impact: the difference between "deployed and improving" and "deployed and stagnant."
The bottom line
WER tells you about one stage of one component. Conversation success rate, TTFA, interruption metrics, and customer-success outcomes tell you about the agent. The eval stack should match the actual product surface, not the slot the academic literature happens to provide. Build the metrics first; the agent improves itself once the metrics are honest.
