Prompt Engineering for Voice Agents

CallMissed
·6 min readGuide

Prompt engineering for voice is not prompt engineering for chat with a TTS bolted on. Voice has different constraints — latency budget, no formatting, interruption tolerance, listener attention span — that change every layer of how you write the prompt. The same prompt that produces excellent responses in ChatGPT will make your voice agent feel slow, robotic, and exhausting.

The core constraint: every word costs latency

In a chat UI, users skim. In voice, they wait. A 200-token response is fine on screen. A 200-token spoken response is 60 seconds of wall time at typical TTS speeds, [Inference], and the user has tuned out by token 80.

The single highest-leverage rule: target 30–80 tokens per turn unless the user explicitly asks for detail. Every prompt should reinforce this:

Code
You are a voice assistant. Keep responses brief — usually one or two sentences.
Only give a longer answer when the user explicitly asks for detail.
Never list more than three items unless asked.

This single instruction does more for perceived agent quality than any other prompt-level change. [Inference]

No formatting, no markdown, no lists

Voice agents read out their text. Markdown bullets become "asterisk space," numbered lists become "one period two period three period," and headings become weird vocal pauses.

Bake this into the system prompt:

Code
Respond in plain prose only. Never use markdown, bullet points, headings,
or numbered lists. Spell out numbers under twenty.

If your TTS can pronounce numerals well, drop the number rule. Most can; some can't.

Persona: tighter than chat, looser than IVR

A good voice persona sits between two failure modes:

  • Too neutral — sounds like a generic IVR robot, low engagement.
  • Too distinctive — sounds like a comedy character, eats clarity for personality.
  • The pattern that works in 2026 is role + tone + register:

    Code
    You are [role: a customer support agent for a fintech company].
    You sound [tone: warm, professional, brisk].
    Your register is [register: friendly but not casual — first names yes, slang no].

    Add one or two illustrative phrases ("opens with 'Hey there'," "uses 'absolutely' and 'sure thing' as confirmations") and the model will stay in character without exhaustive instruction.

    Interruption-friendly responses

    Users interrupt voice agents constantly. Your prompt should produce responses that are interruption-tolerant:

  • Front-load the answer. Put the substantive content in the first sentence; treat anything after it as expansion the user can skip.
  • Avoid windups. "Great question!" or "Let me check that for you" before the actual answer is friction in voice.
  • End on a question or natural pause. Gives the user a clear handoff back without feeling cut off.
  • Example prompt addition:

    Code
    Lead with the answer in the first sentence. Save context and qualifications
    for follow-up sentences. End your turn with a clear handoff — usually a
    question or "anything else?" — so the user knows it's their turn.

    Repair behavior: the secret weapon

    Voice STT is imperfect. Users hear the agent misunderstand and try to correct. Most generic prompts handle correction badly — they argue, repeat, or ask the user to start over.

    Bake repair behavior into the prompt:

    Code
    If the user corrects you or says you misheard, immediately accept the correction
    without arguing. Apologize briefly ("Sorry — got it") and proceed with the
    corrected information. Do not ask them to repeat unless you genuinely missed it.

    This single addition reduces user frustration on noisy calls dramatically. [Inference]

    Latency-aware prompting

    The longer the prompt and conversation history, the longer the LLM time-to-first-token. In voice, that delay is audible. Three patterns:

  • Trim system prompts to what's actually load-bearing. A 2000-token system prompt that re-explains conversation skills is paying TTFT cost on every turn.
  • Summarize long conversations. After 8–10 turns, summarize older context into a compressed memory and drop the verbatim history. Less prompt, faster responses.
  • Use the smallest model that meets your accuracy bar. A 4B parameter LLM that responds in 200ms can feel better than a frontier model that responds in 800ms. The voice user experience is a function of latency more than peak intelligence. [Inference]
  • Tool use in voice

    When voice agents call tools (look up account, schedule appointment, send SMS), they pause for the tool result. The prompt has to handle that natural latency:

    Code
    When you need to look something up, briefly acknowledge it ("One moment, let me check")
    before the tool call. Don't go silent. Once the result arrives, deliver the answer
    in one sentence.

    The acknowledgement bridges the gap between user request and tool result, which can be hundreds of milliseconds to seconds.

    Names, numbers, and proper nouns

    Voice agents say things back to users — order numbers, addresses, account balances, names. Pronunciation matters and the prompt can help:

  • Spell out account numbers and reference codes ("That's A-B-C-1-2-3-4") for clarity.
  • Pronounce common acronyms naturally ("USA" as a word vs U-S-A spelled out, depending on convention).
  • Confirm names back to the user before using them as keys to a database lookup.
  • Example:

    Code
    When repeating order numbers, account numbers, or codes back to the user,
    spell them out clearly with each character. When confirming names,
    ask the user to spell uncommon names.

    What a complete voice system prompt looks like

    A condensed example combining the above:

    Code
    You are an AI voice assistant for [company].
    
    VOICE STYLE:
    - Speak in plain prose. No markdown, lists, or headings.
    - Keep responses to 1–2 sentences unless the user asks for detail.
    - Lead with the answer; expand only if asked.
    - End with a brief handoff so the user knows it's their turn.
    
    PERSONA:
    - Warm, professional, brisk.
    - First names, no slang. Confirm with "absolutely" or "got it."
    
    REPAIR:
    - If the user corrects you, accept and proceed. Do not argue.
    
    TOOLS:
    - Acknowledge before tool calls ("One moment").
    - Deliver results in one sentence.
    
    NUMBERS AND NAMES:
    - Spell out reference codes character by character.
    - Confirm uncommon names by spelling.

    This is roughly 150 tokens. Adjust to your domain.

    Testing voice prompts

    Don't test voice prompts in chat. Test them in voice. The same response that reads fine on screen can sound robotic, stilted, or rushed when spoken. Build a voice-prompt testbed where you:

  • Hit the LLM with realistic conversation history.
  • Pipe the response through your actual TTS.
  • Listen.
  • Iterate on prompts based on listener experience, not transcript reading.

    The bottom line

    Voice prompts are shorter, blunter, and more behavior-driven than chat prompts. Brevity is the single most important constraint; persona, repair behavior, and interruption-friendliness follow closely. Test in voice, not chat. The prompt that wins on screen often loses on the phone.

    Frequently Asked Questions

    How long should a voice agent's response be?
    Aim for 30–80 tokens per turn — usually one or two sentences. Longer only when the user explicitly asks for detail. Most users tune out within 5 seconds of monologue.
    Should I tell the model "you are talking to a user on a phone"?
    Yes. That single instruction shifts the model toward shorter, simpler, more conversational language without explicit length rules. Pair it with persona and brevity instructions.
    How do I get the model to handle STT errors gracefully?
    Prompt for repair behavior — accept corrections, apologize briefly, proceed. Don't have the model ask the user to repeat unless it genuinely missed content. Most STT errors are minor and can be handled by inference rather than re-asking.

    Related Posts