Interruption Handling in Voice Agents: The Hard Problem

CallMissedMay 8, 2026

·5 min readGuide

Voice AI Engineering Real-time Conversation Design

The single most common reason voice agents feel "robotic" is not voice quality, latency, or even reasoning quality. It is interruption handling. A human conversation partner stops talking the moment you start. A bad voice agent talks over you, ignores you, or restarts in confusion. Interruption is the boundary between "tech demo" and "I'd actually use this."

It is also the hardest engineering problem in the voice stack, because it spans every layer at once.

Why it's hard: it's a cross-stage coordination problem

When a user interrupts an AI agent that is mid-speech, you have to make six things happen, in order, in tens of milliseconds:

Voice activity detection notices new user audio.

TTS output stops mid-stream and the audio buffer is flushed.

The in-flight LLM generation is canceled.

STT is re-pointed at the new user input.

Conversation state is reconciled — what did the user actually hear before being interrupted?

The next turn begins, with the agent understanding it was cut off.

Miss any of these and the experience breaks. If TTS keeps playing for 500ms after you start talking, you talk over yourself. If the LLM keeps generating after cancel, you waste tokens. If state is wrong, the agent thinks it said something it didn't, and the next turn is confused.

The cancel-the-LLM pattern

The single most important pattern is fast LLM cancellation. A conventional LLM call is fire-and-forget — you start it, you wait for tokens, you process them. With interruption, you need to abort the in-flight call the moment user speech is detected.

The implementations that work in 2026:

Streaming + abort signal. Every major SDK now supports AbortController or equivalent. The voice pipeline holds the abort handle and triggers it on VAD-detected user speech.

Aggressive timeouts. Even without explicit cancel, set a tight upstream timeout (under 100ms grace) once an interruption is detected. Better to cut the model off than let it run.

Rollback the conversation log. Mark the last assistant turn as "interrupted at token N." This is the state the LLM needs on the next turn so it knows what the user heard.

The mistake most teams make is canceling TTS but letting the LLM keep generating. The user hears silence, the agent thinks it spoke, and the conversation drifts. [Inference] Cancel both, or cancel neither.

Audio buffer flushing

When a user interrupts, you cannot just stop sending new TTS audio. You have to flush whatever has already been sent to the audio output and is sitting in network buffers, codec buffers, or the operating system's audio queue.

In WebRTC pipelines this is usually done by:

Sending a control message to the client to clear its audio jitter buffer.

Reducing playback gain rapidly to zero — a hard stop without an audible click.

Marking the next playback packet with a sequence number that signals "discard everything older."

LiveKit Agents handles most of this in framework code. Per the LiveKit docs, the SDK includes "adaptive interruption handling" with semantic turn detection. You still need to think about which buffer flush strategy fits your codec and client.

Endpointing: the false-positive problem

A second-order issue: how confident are you that the user is actually interrupting versus making a brief acknowledgement noise like "uh-huh" or coughing?

Naïve VAD-only interruption causes false positives constantly. A back-channel "yeah" from the user becomes a full restart. The fix is two-tier endpointing:

Soft VAD for back-channels — register the user is engaged but don't interrupt.

Hard endpointing for full barge-in — only cancel when the user has uttered enough to be a real turn.

Recent vendors have added semantic turn detection — a small classifier model that decides whether the user's incoming speech is a real intent to take the floor or a back-channel. LiveKit's recent releases include this. [Inference] Pipecat's instruction-following work per their conversation with Coval has called out this trade-off explicitly.

Restart semantics

When the user interrupts, what should the agent do next? Three options:

Treat the interruption as a new turn from scratch. The agent forgets what it was about to say and responds to whatever the user just said.

Carry context but acknowledge the interruption. The agent says something like "Sure, before I continue — yes?" and threads the user's question into the prior context.

Defer the original answer. The agent answers the user's interruption, then offers to continue with the previous topic.

For most use cases, option 2 is the most natural. The user gets the impression they were heard, not just timed out.

What "good" looks like in 2026

Production voice agents should target:

Interruption-to-silence latency under 200ms — the time from user starting to speak to the agent's audio actually stopping in the user's ear. [Inference]

State-rollback accuracy above 95% — the agent should correctly know how much of its last response the user heard.

Back-channel false-positive rate under 5% — non-interruption noises should rarely trigger restarts.

These numbers are aspirational benchmarks; few production systems publish theirs. [Unverified] Measure your own.

A pragmatic checklist

If you're building voice agents in 2026:

Use a framework with first-class interruption support — LiveKit Agents, Pipecat, or a managed platform.

Wire LLM cancellation to VAD — never let the model keep generating while the user is speaking.

Implement two-tier endpointing — back-channel detection separate from full barge-in.

Maintain conversation state with "interrupted at" markers so the LLM knows what the user heard.

Measure interruption-to-silence latency end-to-end, not just at any single layer.

The bottom line

Interruption handling is what makes a voice agent feel alive. It's a coordination problem across VAD, TTS, LLM, and state, and it lives in tens of milliseconds. Get it right and users forget they're talking to software. Get it wrong and they hang up.

Frequently Asked Questions

Can I add interruption handling to a custom-built voice pipeline?

You can, but it's substantial work — VAD wiring, TTS buffer flushing, LLM abort handling, and state rollback all have to coexist. Most teams in 2026 use LiveKit Agents or Pipecat, which provide these primitives in framework code.

What's a back-channel and why does it matter?

A back-channel is a brief sound the listener makes — "uh-huh," "right," "okay" — that signals engagement without claiming the floor. Naïve VAD treats these as interruptions, causing false restarts. Production agents need to distinguish back-channels from real barge-ins.

Should the agent acknowledge that it was interrupted?

Usually yes. A small acknowledgement — "Of course," or "Sure, what's up?" — reads as natural and signals the agent heard the user, rather than timing out silently.