Prompt Engineering for Voice Agents
Prompt engineering for voice is not prompt engineering for chat with a TTS bolted on. Voice has different constraints — latency budget, no formatting, interruption tolerance, listener attention span — that change every layer of how you write the prompt. The same prompt that produces excellent responses in ChatGPT will make your voice agent feel slow, robotic, and exhausting.
The core constraint: every word costs latency
In a chat UI, users skim. In voice, they wait. A 200-token response is fine on screen. A 200-token spoken response is 60 seconds of wall time at typical TTS speeds, [Inference], and the user has tuned out by token 80.
The single highest-leverage rule: target 30–80 tokens per turn unless the user explicitly asks for detail. Every prompt should reinforce this:
You are a voice assistant. Keep responses brief — usually one or two sentences.
Only give a longer answer when the user explicitly asks for detail.
Never list more than three items unless asked.This single instruction does more for perceived agent quality than any other prompt-level change. [Inference]
No formatting, no markdown, no lists
Voice agents read out their text. Markdown bullets become "asterisk space," numbered lists become "one period two period three period," and headings become weird vocal pauses.
Bake this into the system prompt:
Respond in plain prose only. Never use markdown, bullet points, headings,
or numbered lists. Spell out numbers under twenty.If your TTS can pronounce numerals well, drop the number rule. Most can; some can't.
Persona: tighter than chat, looser than IVR
A good voice persona sits between two failure modes:
The pattern that works in 2026 is role + tone + register:
You are [role: a customer support agent for a fintech company].
You sound [tone: warm, professional, brisk].
Your register is [register: friendly but not casual — first names yes, slang no].Add one or two illustrative phrases ("opens with 'Hey there'," "uses 'absolutely' and 'sure thing' as confirmations") and the model will stay in character without exhaustive instruction.
Interruption-friendly responses
Users interrupt voice agents constantly. Your prompt should produce responses that are interruption-tolerant:
Example prompt addition:
Lead with the answer in the first sentence. Save context and qualifications
for follow-up sentences. End your turn with a clear handoff — usually a
question or "anything else?" — so the user knows it's their turn.Repair behavior: the secret weapon
Voice STT is imperfect. Users hear the agent misunderstand and try to correct. Most generic prompts handle correction badly — they argue, repeat, or ask the user to start over.
Bake repair behavior into the prompt:
If the user corrects you or says you misheard, immediately accept the correction
without arguing. Apologize briefly ("Sorry — got it") and proceed with the
corrected information. Do not ask them to repeat unless you genuinely missed it.This single addition reduces user frustration on noisy calls dramatically. [Inference]
Latency-aware prompting
The longer the prompt and conversation history, the longer the LLM time-to-first-token. In voice, that delay is audible. Three patterns:
Tool use in voice
When voice agents call tools (look up account, schedule appointment, send SMS), they pause for the tool result. The prompt has to handle that natural latency:
When you need to look something up, briefly acknowledge it ("One moment, let me check")
before the tool call. Don't go silent. Once the result arrives, deliver the answer
in one sentence.The acknowledgement bridges the gap between user request and tool result, which can be hundreds of milliseconds to seconds.
Names, numbers, and proper nouns
Voice agents say things back to users — order numbers, addresses, account balances, names. Pronunciation matters and the prompt can help:
Example:
When repeating order numbers, account numbers, or codes back to the user,
spell them out clearly with each character. When confirming names,
ask the user to spell uncommon names.What a complete voice system prompt looks like
A condensed example combining the above:
You are an AI voice assistant for [company].
VOICE STYLE:
- Speak in plain prose. No markdown, lists, or headings.
- Keep responses to 1–2 sentences unless the user asks for detail.
- Lead with the answer; expand only if asked.
- End with a brief handoff so the user knows it's their turn.
PERSONA:
- Warm, professional, brisk.
- First names, no slang. Confirm with "absolutely" or "got it."
REPAIR:
- If the user corrects you, accept and proceed. Do not argue.
TOOLS:
- Acknowledge before tool calls ("One moment").
- Deliver results in one sentence.
NUMBERS AND NAMES:
- Spell out reference codes character by character.
- Confirm uncommon names by spelling.This is roughly 150 tokens. Adjust to your domain.
Testing voice prompts
Don't test voice prompts in chat. Test them in voice. The same response that reads fine on screen can sound robotic, stilted, or rushed when spoken. Build a voice-prompt testbed where you:
Iterate on prompts based on listener experience, not transcript reading.
The bottom line
Voice prompts are shorter, blunter, and more behavior-driven than chat prompts. Brevity is the single most important constraint; persona, repair behavior, and interruption-friendliness follow closely. Test in voice, not chat. The prompt that wins on screen often loses on the phone.