VAD and Endpointing: Why Your Voice Agent Feels Slow

CallMissed
·5 min readGuide

If your voice agent feels sluggish, the culprit is almost never the LLM. It is endpointing — the silence-detection logic that decides "the user is done speaking, start processing." Most teams over-engineer their LLM stack and under-engineer their VAD and endpointing, then wonder why their pipeline feels half a second slower than it should.

What VAD and endpointing actually do

Voice activity detection (VAD) classifies short audio chunks as speech or non-speech. Silero VAD, the most widely deployed open-source VAD in 2026, processes each 30+ ms chunk in under 1 ms on a single CPU thread, per the Silero VAD repo. Its real-time factor on CPU is around 0.004, meaning ~15 seconds to process an hour of audio.

Endpointing is the higher-level decision: given a sequence of speech and silence, has the user finished a turn? Endpointing is built on top of VAD but adds time-window logic and often a semantic component.

VAD answers "is there voice in this 30ms chunk." Endpointing answers "should I send this chunk to the LLM yet, or wait for more." They are different problems and need to be tuned separately.

The endpointing latency budget

A voice agent's end-to-end response latency typically breaks down something like:

  • Network in (RTP/Opus): ~50ms
  • VAD + endpointing wait: 200–800ms
  • STT finalization: 100–400ms
  • LLM time-to-first-token: 200–600ms
  • TTS time-to-first-audio: 75–300ms
  • Network out: ~50ms
  • Endpointing is often the single largest line item. The default endpointing window in many libraries is 700–1000ms of trailing silence — and that silence is dead time before any other component even starts. If you cut endpointing from 800ms to 300ms, you shave half a second off perceived latency without touching the LLM.

    Why endpointing windows are so generous by default

    Aggressive endpointing causes false cutoffs — the agent decides the user is done and starts processing while the user was just pausing for breath. The user hears the agent talk over them mid-sentence.

    The trade-off is real:

  • Long window (700–1000ms): Few false cutoffs, slow feel.
  • Short window (200–400ms): Faster feel, more false cutoffs especially for slower speakers, non-native speakers, and hesitation-prone callers.
  • In 2026 the best-of-both pattern is adaptive endpointing: start with a short window and lengthen it on signals of uncertainty — recent disfluencies, mid-utterance prosody, partially complete sentences. LiveKit Agents 1.5 introduced semantic turn detection that does some of this. [Inference]

    False cutoffs vs false retentions

    The two failure modes:

  • False cutoff. The agent thinks the user is done when they aren't, talks over them, breaks the conversation.
  • False retention. The agent waits longer than needed, the user wonders if they were heard, sometimes restarts.
  • False cutoffs are user-hostile and obvious. False retentions are user-hostile and subtle. The latter is the "is it broken or just slow" frustration that drives users away without complaints.

    A well-tuned system minimizes both, but they're zero-sum on the endpointing window alone. The escape hatch is to add semantic signal — actually look at what the user has said so far and decide if it's a complete thought.

    Silero VAD: the workhorse

    Silero VAD has been the de facto open-source VAD for several years for a reason: it's fast, accurate, and runs on CPU. Its key properties in 2026:

  • 30ms chunk processing under 1ms on CPU.
  • Pre-trained on multilingual data.
  • Single binary, no GPU dependency.
  • Used inside Pipecat, LiveKit Agents, and most managed voice platforms.
  • Picovoice's Cobra is a commercial alternative pitched at edge use cases. WebRTC VAD remains in some legacy pipelines but is widely considered worse than Silero on noisy audio. Picovoice's VAD comparison covers the trade-offs.

    Tuning recipe for production

    A practical starting point for tuning VAD + endpointing on a new voice agent:

  • Use Silero VAD with default thresholds. Don't reinvent it.
  • Set endpointing window to 400ms as a baseline. Lengthen on metrics, not on hunches.
  • Add a minimum-utterance-length filter. Don't endpoint until the user has spoken at least 200ms — this filters out coughs, breath sounds, and mic clicks.
  • Layer in a semantic check if your stack supports it. A short classifier or LLM call asking "is this a complete user turn" can reclaim 100–200ms without false cutoffs.
  • Measure both error modes. Tag a sample of real conversations with "false cutoff," "false retention," "correct." Tune until both error rates are below 5%.
  • What goes wrong in noisy environments

    VAD performance degrades in:

  • Reverberant rooms. Echo confuses energy-based detectors.
  • Low-SNR phone audio. Cell or VOIP audio with background traffic noise.
  • Multi-speaker environments. Two people in the room, one is the actual user.
  • The mitigations are noise suppression upstream of VAD (RNNoise, Krisp), better hardware (good mics matter), and acoustic echo cancellation in WebRTC pipelines. Don't expect any VAD to compensate for a fundamentally broken audio path. [Inference]

    The naturalness ceiling

    Even with surgical tuning, fully natural-feeling endpointing requires a perfect predictor of human turn-taking — which is itself an open research problem. The best 2026 systems reach roughly the level of "competent non-native speaker" rather than "native interlocutor." [Speculation] Don't promise users magic.

    The bottom line

    Tune endpointing first, tune the LLM second. A 500ms reduction in endpointing latency improves perceived agent speed more than any model upgrade. Use Silero VAD, start with a 400ms window, layer in semantic checks, measure both error modes, and revisit as your traffic profile evolves.

    Frequently Asked Questions

    How do I measure endpointing latency in production?
    Log the timestamp of last-detected-speech and the timestamp of endpoint-decision for each turn. The delta is your endpointing latency. Roll up p50 and p95 over a sample of real calls — averages hide tail latency.
    Is server-side VAD or client-side VAD better?
    Server-side VAD is the norm in 2026. Client-side VAD reduces upstream bandwidth but spreads tuning across heterogeneous devices and complicates A/B testing of VAD parameters. Server-side gives you one place to tune and consistent behavior across clients.
    Can the LLM do its own endpointing?
    Sort of — semantic turn detection is exactly that. A small classifier or LLM call decides whether the recent user audio represents a complete turn. It's a useful layer on top of energy-based VAD, but not a replacement.

    Related Posts