Interruption Handling in Voice Agents: The Hard Problem
The single most common reason voice agents feel "robotic" is not voice quality, latency, or even reasoning quality. It is interruption handling. A human conversation partner stops talking the moment you start. A bad voice agent talks over you, ignores you, or restarts in confusion. Interruption is the boundary between "tech demo" and "I'd actually use this."
It is also the hardest engineering problem in the voice stack, because it spans every layer at once.
Why it's hard: it's a cross-stage coordination problem
When a user interrupts an AI agent that is mid-speech, you have to make six things happen, in order, in tens of milliseconds:
Miss any of these and the experience breaks. If TTS keeps playing for 500ms after you start talking, you talk over yourself. If the LLM keeps generating after cancel, you waste tokens. If state is wrong, the agent thinks it said something it didn't, and the next turn is confused.
The cancel-the-LLM pattern
The single most important pattern is fast LLM cancellation. A conventional LLM call is fire-and-forget — you start it, you wait for tokens, you process them. With interruption, you need to abort the in-flight call the moment user speech is detected.
The implementations that work in 2026:
AbortController or equivalent. The voice pipeline holds the abort handle and triggers it on VAD-detected user speech.The mistake most teams make is canceling TTS but letting the LLM keep generating. The user hears silence, the agent thinks it spoke, and the conversation drifts. [Inference] Cancel both, or cancel neither.
Audio buffer flushing
When a user interrupts, you cannot just stop sending new TTS audio. You have to flush whatever has already been sent to the audio output and is sitting in network buffers, codec buffers, or the operating system's audio queue.
In WebRTC pipelines this is usually done by:
LiveKit Agents handles most of this in framework code. Per the LiveKit docs, the SDK includes "adaptive interruption handling" with semantic turn detection. You still need to think about which buffer flush strategy fits your codec and client.
Endpointing: the false-positive problem
A second-order issue: how confident are you that the user is actually interrupting versus making a brief acknowledgement noise like "uh-huh" or coughing?
Naïve VAD-only interruption causes false positives constantly. A back-channel "yeah" from the user becomes a full restart. The fix is two-tier endpointing:
Recent vendors have added semantic turn detection — a small classifier model that decides whether the user's incoming speech is a real intent to take the floor or a back-channel. LiveKit's recent releases include this. [Inference] Pipecat's instruction-following work per their conversation with Coval has called out this trade-off explicitly.
Restart semantics
When the user interrupts, what should the agent do next? Three options:
For most use cases, option 2 is the most natural. The user gets the impression they were heard, not just timed out.
What "good" looks like in 2026
Production voice agents should target:
These numbers are aspirational benchmarks; few production systems publish theirs. [Unverified] Measure your own.
A pragmatic checklist
If you're building voice agents in 2026:
The bottom line
Interruption handling is what makes a voice agent feel alive. It's a coordination problem across VAD, TTS, LLM, and state, and it lives in tens of milliseconds. Get it right and users forget they're talking to software. Get it wrong and they hang up.