Evaluating Voice Agents: Beyond Word Error Rate

CallMissed
·6 min readGuide

Word Error Rate is the most-quoted metric in voice AI and the least useful for evaluating actual voice agents. WER measures STT accuracy on transcribed audio. It tells you nothing about whether your agent answered the user's question, finished the task, sounded natural, or kept the conversation alive. In 2026, the teams that ship better voice agents are the ones that have moved beyond WER.

What WER actually measures

WER counts substitutions, insertions, and deletions in the transcript versus a reference. It's a clean, automatable, decades-old metric. It's also a single-component measure of one stage in a multi-stage pipeline.

A voice agent can have:

  • 4% WER STT and a terrible user experience.
  • 15% WER STT and a great user experience.
  • What separates them is everything that happens after the transcript.

    What actually matters: conversation success rate

    The single highest-leverage metric for a voice agent is conversation success rate — the percentage of conversations that ended in the user's intent being served.

    Definition matters. For an appointment-booking agent, success is "appointment scheduled correctly." For a tier-1 support agent, success is "user's question answered or correctly escalated." For a qualification agent, success is "qualified lead's data captured accurately."

    Measuring this requires:

  • Defined success criteria per use case.
  • A way to label outcomes — automated where possible, human-rated where needed.
  • A consistent sampling strategy.
  • A good production target is 80%+ on simple use cases and 60%+ on complex ones. [Inference] Below those numbers, the user experience pain outweighs the cost savings.

    Latency metrics that matter

    Three latency numbers worth tracking:

  • Time-to-first-audio (TTFA). From end of user turn to first agent audio. Captures the entire pipeline. Target under 1.5s for natural conversation.
  • Interruption-to-silence latency. From user interrupting to agent audio actually stopping. Target under 200ms.
  • End-to-end conversation duration. Long conversations that end in success may signal user-friendly behavior; long conversations ending in failure signal stuck loops.
  • Measure p50, p95, and p99. The tail matters in voice — one slow turn ruins the whole call. Per Deepgram's latency guide, latency is dominated by what happens after audio capture, not the network alone.

    Interruption metrics

    Two specific numbers to track:

  • False-cutoff rate. Agent decides user is done when they aren't. Should be under 5%.
  • Interruption response latency. When user interrupts agent, how fast does the agent stop. Should be under 200ms.
  • These are usually invisible to standard monitoring. You have to instrument them explicitly. The audit comes from sampling real conversations and tagging.

    Customer-success eval

    For business-critical agents, customer success rates per cohort matter more than per-call metrics. Track:

  • Resolution rate. % of customers whose issue was resolved without human follow-up.
  • Repeat-call rate. % of customers who call back within X days for the same issue. (High = the agent didn't actually resolve.)
  • Escalation accuracy. When the agent escalates, was the escalation appropriate?
  • CSAT or NPS post-call. A 1-question post-call survey is the gold standard if you can swing it without harming response rate.
  • These are slow-moving metrics — measured over weeks, not minutes — but they're the ones that determine whether the agent stays deployed.

    Regression testing for voice

    When you change a prompt, swap a model, or upgrade a TTS, what breaks? Voice agents need regression tests, but the format is unusual:

  • Scenario replays. A fixed set of "conversations" with synthetic user turns. Run the new agent through each and check outcomes.
  • Audio replays. Real recorded user audio replayed at the agent. Tests the full pipeline including STT.
  • LLM-judged grading. A grader LLM rates the new agent's responses against a rubric. Approximate but scalable.
  • A practical setup is 100 scenarios that exercise the main intents plus 30 known edge cases (interruptions, code-switching, repair, escalation triggers). Run them on every change. Compare success rates and latency to the previous version.

    What synthetic eval misses

    Synthetic evaluation has limits:

  • Real users do unexpected things. Tangents, profanity, multitasking, lying about identity. Synthetic users don't.
  • Real audio has noise. Background TV, kids, road noise, terrible Bluetooth. Synthetic audio is clean.
  • Real timing is messier. Long pauses, rapid-fire interruptions, half-words.
  • Treat synthetic eval as a regression-prevention tool. Treat real-call sampling as the ground truth. The teams that ship best run both.

    Who does the labeling

    Three options for labeling conversations:

  • Pure human raters. Most accurate, slow, expensive. Good for ground-truth datasets and rubric calibration.
  • LLM-as-judge. Cheap, fast, biased toward verbose or polite answers. [Inference] Useful for triage and trend-detection.
  • Hybrid. LLM-as-judge filters; humans rate the disagreement set. Best ROI for most teams.
  • Calibrate the LLM judge against human judgment quarterly. Drift is real.

    Vendors and platforms

    A few platforms in 2026 specialize in voice agent evaluation: Coval, Hamming, and others have built around scenario replay, LLM-judged grading, and real-call sampling. Coval's 2026 conversation with Pipecat and Ultravox covers the state of voice eval at the practitioner level.

    For early-stage teams, a homegrown setup with scripted scenarios and an LLM grader is enough. Adopt a platform when scenario count crosses a few hundred or when human raters are the bottleneck.

    A pragmatic eval stack

    For a 2026 voice agent in production:

  • Conversation success rate as the headline metric, defined per use case.
  • TTFA p50/p95/p99 as the latency headline.
  • Interruption metrics (false cutoff, response latency) sampled weekly.
  • Customer-success metrics (resolution, repeat-call, CSAT) tracked monthly.
  • Regression scenario suite of ~100 scenarios run on every change.
  • Real-call sampling of 50+ conversations per week, hand-rated.
  • Total time investment: ~10 hours/week of human attention. Total impact: the difference between "deployed and improving" and "deployed and stagnant."

    The bottom line

    WER tells you about one stage of one component. Conversation success rate, TTFA, interruption metrics, and customer-success outcomes tell you about the agent. The eval stack should match the actual product surface, not the slot the academic literature happens to provide. Build the metrics first; the agent improves itself once the metrics are honest.

    Frequently Asked Questions

    Should I use WER at all?
    Yes, but as a component-level diagnostic, not a product-level metric. If WER is bad in a specific language or noise condition, that's a signal to dig into STT. But low WER does not mean the agent is good; high WER does not always mean it's bad.
    How big should my regression scenario suite be?
    Start with 50 scenarios covering main intents. Grow to 100–200 as you find production failure modes. Beyond ~500 scenarios, the runtime and maintenance cost outweighs marginal coverage gain. Quality of scenarios matters more than count.
    Can I use one LLM to grade another LLM's voice responses?
    Yes, with caveats. LLM-as-judge is biased toward verbose, polite responses and tends to over-grade fluent text. Calibrate it quarterly against human raters. For high-stakes evaluation, use the LLM as triage and humans as ground truth.

    Related Posts