VAD and Endpointing: Why Your Voice Agent Feels Slow

CallMissedMay 8, 2026

·5 min readGuide

If your voice agent feels sluggish, the culprit is almost never the LLM. It is endpointing — the silence-detection logic that decides "the user is done speaking, start processing." Most teams over-engineer their LLM stack and under-engineer their VAD and endpointing, then wonder why their pipeline feels half a second slower than it should.

What VAD and endpointing actually do

Voice activity detection (VAD) classifies short audio chunks as speech or non-speech. Silero VAD, the most widely deployed open-source VAD in 2026, processes each 30+ ms chunk in under 1 ms on a single CPU thread, per the Silero VAD repo. Its real-time factor on CPU is around 0.004, meaning ~15 seconds to process an hour of audio.

Endpointing is the higher-level decision: given a sequence of speech and silence, has the user finished a turn? Endpointing is built on top of VAD but adds time-window logic and often a semantic component.

VAD answers "is there voice in this 30ms chunk." Endpointing answers "should I send this chunk to the LLM yet, or wait for more." They are different problems and need to be tuned separately.

The endpointing latency budget

A voice agent's end-to-end response latency typically breaks down something like:

Network in (RTP/Opus): ~50ms

VAD + endpointing wait: 200–800ms

STT finalization: 100–400ms

LLM time-to-first-token: 200–600ms

TTS time-to-first-audio: 75–300ms

Network out: ~50ms

Endpointing is often the single largest line item. The default endpointing window in many libraries is 700–1000ms of trailing silence — and that silence is dead time before any other component even starts. If you cut endpointing from 800ms to 300ms, you shave half a second off perceived latency without touching the LLM.

Why endpointing windows are so generous by default

Aggressive endpointing causes false cutoffs — the agent decides the user is done and starts processing while the user was just pausing for breath. The user hears the agent talk over them mid-sentence.

The trade-off is real:

Long window (700–1000ms): Few false cutoffs, slow feel.

Short window (200–400ms): Faster feel, more false cutoffs especially for slower speakers, non-native speakers, and hesitation-prone callers.

In 2026 the best-of-both pattern is adaptive endpointing: start with a short window and lengthen it on signals of uncertainty — recent disfluencies, mid-utterance prosody, partially complete sentences. LiveKit Agents 1.5 introduced semantic turn detection that does some of this. [Inference]

False cutoffs vs false retentions

The two failure modes:

False cutoff. The agent thinks the user is done when they aren't, talks over them, breaks the conversation.

False retention. The agent waits longer than needed, the user wonders if they were heard, sometimes restarts.

False cutoffs are user-hostile and obvious. False retentions are user-hostile and subtle. The latter is the "is it broken or just slow" frustration that drives users away without complaints.

A well-tuned system minimizes both, but they're zero-sum on the endpointing window alone. The escape hatch is to add semantic signal — actually look at what the user has said so far and decide if it's a complete thought.

Silero VAD: the workhorse

Silero VAD has been the de facto open-source VAD for several years for a reason: it's fast, accurate, and runs on CPU. Its key properties in 2026:

30ms chunk processing under 1ms on CPU.

Pre-trained on multilingual data.

Single binary, no GPU dependency.

Used inside Pipecat, LiveKit Agents, and most managed voice platforms.

Picovoice's Cobra is a commercial alternative pitched at edge use cases. WebRTC VAD remains in some legacy pipelines but is widely considered worse than Silero on noisy audio. Picovoice's VAD comparison covers the trade-offs.

Tuning recipe for production

A practical starting point for tuning VAD + endpointing on a new voice agent:

Use Silero VAD with default thresholds. Don't reinvent it.

Set endpointing window to 400ms as a baseline. Lengthen on metrics, not on hunches.

Add a minimum-utterance-length filter. Don't endpoint until the user has spoken at least 200ms — this filters out coughs, breath sounds, and mic clicks.

Layer in a semantic check if your stack supports it. A short classifier or LLM call asking "is this a complete user turn" can reclaim 100–200ms without false cutoffs.

Measure both error modes. Tag a sample of real conversations with "false cutoff," "false retention," "correct." Tune until both error rates are below 5%.

What goes wrong in noisy environments

VAD performance degrades in:

Reverberant rooms. Echo confuses energy-based detectors.

Low-SNR phone audio. Cell or VOIP audio with background traffic noise.

Multi-speaker environments. Two people in the room, one is the actual user.

The mitigations are noise suppression upstream of VAD (RNNoise, Krisp), better hardware (good mics matter), and acoustic echo cancellation in WebRTC pipelines. Don't expect any VAD to compensate for a fundamentally broken audio path. [Inference]

The naturalness ceiling

Even with surgical tuning, fully natural-feeling endpointing requires a perfect predictor of human turn-taking — which is itself an open research problem. The best 2026 systems reach roughly the level of "competent non-native speaker" rather than "native interlocutor." [Speculation] Don't promise users magic.

The bottom line

Tune endpointing first, tune the LLM second. A 500ms reduction in endpointing latency improves perceived agent speed more than any model upgrade. Use Silero VAD, start with a 400ms window, layer in semantic checks, measure both error modes, and revisit as your traffic profile evolves.

Frequently Asked Questions

How do I measure endpointing latency in production?

Log the timestamp of last-detected-speech and the timestamp of endpoint-decision for each turn. The delta is your endpointing latency. Roll up p50 and p95 over a sample of real calls — averages hide tail latency.

Is server-side VAD or client-side VAD better?

Server-side VAD is the norm in 2026. Client-side VAD reduces upstream bandwidth but spreads tuning across heterogeneous devices and complicates A/B testing of VAD parameters. Server-side gives you one place to tune and consistent behavior across clients.

Can the LLM do its own endpointing?

Sort of — semantic turn detection is exactly that. A small classifier or LLM call decides whether the recent user audio represents a complete turn. It's a useful layer on top of energy-based VAD, but not a replacement.