Emotion-Aware TTS: From Tone to Empathy

CallMissed
·6 min readArticle

For most of TTS history, the goal was clarity. The model said the words and you understood them. By 2024 that bar was met across major languages. By 2026 the frontier has moved: TTS that does not just say the words but conveys how the words should feel. Emotion-aware TTS is the next layer of voice naturalness, and it is also the hardest one to evaluate.

What emotion-aware TTS actually means

Three increasingly ambitious definitions:

  • Tone control. The user can ask for "warm," "professional," or "urgent" delivery and the model adjusts prosody.
  • Context-driven emotion. The model reads the text and decides on emotional delivery without explicit instruction.
  • Empathic response. The model adjusts to the listener — when the user sounds upset, the agent's voice softens.
  • Each step up requires more model capability. Tone control is widely available. Context-driven emotion is shipping. Empathic response is the leading edge.

    OpenAI's instructable voices

    OpenAI's gpt-4o-mini-tts and gpt-realtime models added explicit instructability in 2025–2026. Per OpenAI's announcement, developers can instruct the model on how to speak — examples like "speak quickly and professionally" or "speak empathetically in a French accent" — alongside what to say.

    The Realtime API's gpt-realtime model added two voices, Cedar and Marin, designed for instructable voice in production. Per the Realtime API blog, the model is trained to follow fine-grained delivery instructions.

    This is tone control as a first-class API surface. It works well for predictable use cases — customer service warmth, urgent alerts, calm navigation — and degrades on more nuanced emotional ranges.

    Hume's empathic voice interface

    Hume AI takes a more ambitious approach. Per Hume's EVI documentation, EVI 3 integrates transcription, reasoning, and voice synthesis into a single empathic voice model.

    The claims:

  • Detects emotion in user audio — sarcasm, frustration, excitement.
  • Adjusts agent voice in response — speaking softly to a stressed user, energetically to an excited one.
  • Replicates speaking styles, accents, and emotional tones in real time.
  • Near-instant responses (~300ms), per Hume's coverage.
  • EVI 1 and EVI 2 were retired in August 2025; EVI 3 is the current generation. The product is well-suited for emotion-heavy use cases — coaching, therapy support, customer recovery flows.

    Cartesia and ElevenLabs

    The other major TTS vendors have each added emotion-adjacent features:

  • Cartesia Sonic-3 ships with "AI laughter and emotion," per Cartesia, with time-to-first-audio around 90ms (40ms on Turbo variants).
  • ElevenLabs Multilingual v2 and v3 prioritize quality and expressiveness for longer-form content, while Flash v2.5 prioritizes latency. Per ElevenLabs, Flash v2.5 hits roughly 75ms inference latency.
  • The vendor positioning in 2026: ElevenLabs for nuanced delivery in long-form, Cartesia for fast emotional TTS in real-time, Hume for full empathic loops, OpenAI for instructable mid-quality real-time. Sarvam Bulbul covers the Indian-language equivalent. [Inference]

    When emotion matters and when it doesn't

    A useful framing: emotion in TTS adds value where the listener's emotional state is the product. It adds little where the content is purely informational.

    High emotional value:

  • Customer recovery calls (frustrated users).
  • Healthcare and mental wellness.
  • Bereavement, customer success retention.
  • Coaching and education.
  • Sales when relationship-building is the goal.
  • Low emotional value:

  • Order status, balance lookups, scheduling.
  • Read-aloud notifications.
  • Internal IVR dispatch.
  • Quick informational responses.
  • Spending on Hume EVI 3 for an order-status agent is over-investment. Using the cheapest neutral TTS for a coaching agent is under-investment.

    Evaluation difficulty

    Emotion in TTS is hard to evaluate for one stubborn reason: humans disagree on what the right emotion was.

    Three evaluation approaches in use:

  • Human listening tests. Rate samples on dimensions like warmth, urgency, naturalness. Robust but slow and subjective.
  • Comparative ranking. Pairwise comparison of two TTS samples — which feels more empathetic. Less subjective per pair but doesn't yield absolute scores.
  • Automated emotion classifiers. Models that score audio on emotion dimensions. [Inference] Useful but biased toward extreme samples.
  • Cross-cultural variation matters too. "Warm" delivery in US English is different from "warm" in Japanese; what reads as professional in one culture reads as cold in another. [Speculation]

    The disclosure question

    Emotion-aware TTS raises a transparency question that neutral TTS mostly avoids. When an agent's voice softens because the user sounds upset, is that:

  • Helpful adaptation the user appreciates?
  • Manipulation the user would object to if disclosed?
  • In 2026 the EU AI Act's disclosure requirements (per Article 50) apply to AI-generated audio generally. Specific rules around emotional manipulation in voice are still emerging. [Inference] A safe default is to be transparent that the voice is AI and that delivery may adapt to context.

    Practical advice for builders

    Five concrete patterns:

  • Default to neutral TTS unless your use case lives in the "high emotional value" column.
  • Test instructable voices on your specific delivery prompts before committing — vendor demos are best-case.
  • Use Hume EVI 3 for full empathic loops, not just for emotion in output. Half-implementing empathy is worse than not implementing it.
  • Measure what matters. If your goal is customer recovery, measure recovery rates after deploying emotional TTS — not whether testers think it sounds nicer.
  • Be transparent. Voice users are increasingly aware of AI; covert emotional adaptation backfires when discovered.
  • Where this is heading

    Two probable trajectories for 2026–2027:

  • Wider availability of instructability. Most major TTS vendors will ship some form of delivery instruction by end of 2026. [Inference]
  • Empathic loops as commodity. EVI-style two-way emotional adaptation will move from premium feature to expected feature in customer-facing voice agents over the next 18 months. [Speculation]
  • The frontier moves from "does the voice sound human" to "does the voice respond like a human" — a much harder bar.

    The bottom line

    Emotion-aware TTS in 2026 spans tone control (widely available), context-driven emotion (shipping), and empathic response (leading edge). The right pick depends on whether emotional delivery is the product or just a feature. Evaluate carefully, deploy where the listener's emotional state is the product, and be transparent about what the voice is doing. The technology has crossed a real threshold; using it well is now the design problem.

    Frequently Asked Questions

    Is Hume EVI 3 worth the integration over instructable OpenAI TTS?
    For genuinely empathic two-way conversations — coaching, therapy support, customer recovery — Hume's emotion detection plus expressive synthesis is the more complete loop. For one-way tone control in routine support, OpenAI's instructable voices are simpler and cheaper to integrate.
    How do I A/B test emotional TTS?
    Compare matched user cohorts and measure outcomes that matter — task success, retention, post-call sentiment, repeat-call rate. Don't rely on internal listening tests; they're biased toward novelty. The user's behavior is the eval.
    Are there latency penalties for emotion-aware TTS?
    Some. Hume EVI's ~300ms response window is competitive but not the fastest. Instructable TTS adds minimal latency over base TTS in most vendors. Real-time emotional adaptation that listens first and synthesizes second necessarily adds round-trip time compared to one-way TTS.

    Related Posts