VAD and Endpointing: Why Your Voice Agent Feels Slow
If your voice agent feels sluggish, the culprit is almost never the LLM. It is endpointing — the silence-detection logic that decides "the user is done speaking, start processing." Most teams over-engineer their LLM stack and under-engineer their VAD and endpointing, then wonder why their pipeline feels half a second slower than it should.
What VAD and endpointing actually do
Voice activity detection (VAD) classifies short audio chunks as speech or non-speech. Silero VAD, the most widely deployed open-source VAD in 2026, processes each 30+ ms chunk in under 1 ms on a single CPU thread, per the Silero VAD repo. Its real-time factor on CPU is around 0.004, meaning ~15 seconds to process an hour of audio.
Endpointing is the higher-level decision: given a sequence of speech and silence, has the user finished a turn? Endpointing is built on top of VAD but adds time-window logic and often a semantic component.
VAD answers "is there voice in this 30ms chunk." Endpointing answers "should I send this chunk to the LLM yet, or wait for more." They are different problems and need to be tuned separately.
The endpointing latency budget
A voice agent's end-to-end response latency typically breaks down something like:
Endpointing is often the single largest line item. The default endpointing window in many libraries is 700–1000ms of trailing silence — and that silence is dead time before any other component even starts. If you cut endpointing from 800ms to 300ms, you shave half a second off perceived latency without touching the LLM.
Why endpointing windows are so generous by default
Aggressive endpointing causes false cutoffs — the agent decides the user is done and starts processing while the user was just pausing for breath. The user hears the agent talk over them mid-sentence.
The trade-off is real:
In 2026 the best-of-both pattern is adaptive endpointing: start with a short window and lengthen it on signals of uncertainty — recent disfluencies, mid-utterance prosody, partially complete sentences. LiveKit Agents 1.5 introduced semantic turn detection that does some of this. [Inference]
False cutoffs vs false retentions
The two failure modes:
False cutoffs are user-hostile and obvious. False retentions are user-hostile and subtle. The latter is the "is it broken or just slow" frustration that drives users away without complaints.
A well-tuned system minimizes both, but they're zero-sum on the endpointing window alone. The escape hatch is to add semantic signal — actually look at what the user has said so far and decide if it's a complete thought.
Silero VAD: the workhorse
Silero VAD has been the de facto open-source VAD for several years for a reason: it's fast, accurate, and runs on CPU. Its key properties in 2026:
Picovoice's Cobra is a commercial alternative pitched at edge use cases. WebRTC VAD remains in some legacy pipelines but is widely considered worse than Silero on noisy audio. Picovoice's VAD comparison covers the trade-offs.
Tuning recipe for production
A practical starting point for tuning VAD + endpointing on a new voice agent:
What goes wrong in noisy environments
VAD performance degrades in:
The mitigations are noise suppression upstream of VAD (RNNoise, Krisp), better hardware (good mics matter), and acoustic echo cancellation in WebRTC pipelines. Don't expect any VAD to compensate for a fundamentally broken audio path. [Inference]
The naturalness ceiling
Even with surgical tuning, fully natural-feeling endpointing requires a perfect predictor of human turn-taking — which is itself an open research problem. The best 2026 systems reach roughly the level of "competent non-native speaker" rather than "native interlocutor." [Speculation] Don't promise users magic.
The bottom line
Tune endpointing first, tune the LLM second. A 500ms reduction in endpointing latency improves perceived agent speed more than any model upgrade. Use Silero VAD, start with a 400ms window, layer in semantic checks, measure both error modes, and revisit as your traffic profile evolves.