TTS Showdown 2026: ElevenLabs vs. Cartesia vs. OpenAI vs. Sesame

CallMissed
·5 min readComparison

Text-to-speech got good somewhere in late 2024. By 2026, "good enough to fool a casual listener" is table stakes for every major vendor. The interesting differences now are at the edges: latency under 100ms, instructable emotion, self-hostability, and the long tail of accents and languages. Here is how the four leaders compare.

ElevenLabs: the quality benchmark

ElevenLabs remains the model the others are measured against. Voices are nearly indistinguishable from human speakers, the emotional range is wide, and voice cloning quality is best-in-class. The Flash v2.5 endpoint returns first audio in ~75ms, which is competitive even on the latency axis ElevenLabs was historically weakest on.

Strengths: Voice quality, prosody, multilingual coverage, voice cloning ecosystem.

Weaknesses: Per-character pricing scales poorly at high volume. Some self-hosted alternatives match it on quality for narrow use cases.

Pick ElevenLabs when: premium support, sales calls, high-conversion-value voice surfaces. Anywhere the voice itself sells the product.

Cartesia: the latency winner

Cartesia Sonic-2 returns first audio in ~95ms and the architecture is purpose-built for streaming. In blinded human tests cited by the company, Sonic-2 was preferred over ElevenLabs Flash V2 by 61.4% to 38.6%. That is a striking number — quality has caught up while latency stayed ahead.

Strengths: Lowest latency in the category, increasingly competitive quality, predictable streaming behavior.

Weaknesses: Voice library is narrower than ElevenLabs. Less established for cloning use cases.

Pick Cartesia when: real-time voice agents where TTFA is the conversion metric. Customer support agents, scheduling bots, anything where 100ms saved per turn translates directly to user satisfaction.

OpenAI Voice / GPT-4o Audio: the instructable choice

OpenAI's TTS endpoints (and the GPT-4o audio modality) lead on instructable voice character. You can prompt the model with tone descriptions — "warm and patient," "clipped and professional," "excited" — and it adjusts. This is a different axis from raw quality or latency. It matters when one product surface needs to express different emotions in different moments.

Strengths: Instructable tone, integrated with the multimodal LLM stack, broad language coverage.

Weaknesses: Latency lags Cartesia/ElevenLabs Flash. Pricing is OpenAI-pricing.

Pick OpenAI Voice when: narrative content, characters in games, audio surfaces that need to adapt tone within a single session.

Sesame: the open-source dark horse

Sesame's Conversational Speech Model (CSM, ~1B parameters, Llama-derived) launched in February 2025 and went open-source in March 2025. The public demos (Maya and Miles) were notable for reproducing pauses, interruptions, and emphasis based on full conversation history — eliminating the "uncanny valley" that plagued earlier TTS models. As of 2026 it remains the only fully self-hostable model that approaches commercial quality. [Inference based on community benchmarks]

Strengths: Self-hostable, conversation-aware prosody, no per-character billing.

Weaknesses: Quality is still behind ElevenLabs on premium voices. Operations burden of running it yourself.

Pick Sesame when: data residency rules require self-hosting, volume is high enough that per-character pricing dominates, or you have a strong ML ops team.

Hume: the emotion specialist

Worth a mention even though it is not in the four-way headline. Hume leads on emotion expression specifically — its EVI (Empathic Voice Interface) model adapts tone in response to the user's emotional state, not just prompt instructions. For mental-health, coaching, or emotionally-loaded use cases, it is worth evaluating.

How to actually choose

Three questions, in order:

  • Is the voice itself the product? (Audiobooks, premium support, characters.)
  • → ElevenLabs.

  • Is sub-100ms first-audio non-negotiable? (Voice agents, live translation.)
  • → Cartesia.

  • Do you need to switch tone within a session, or run the model yourself?
  • → OpenAI Voice (tone) or Sesame (self-host).

    If you are still on the fence: the right move in 2026 is to A/B test two voices on real users. Quality preferences are surprisingly bimodal — about a third of testers prefer the "warmer" model, a third prefer the "clearer" one, and a third can not tell. Pick the voice that converts, not the voice that benchmarks.

    What is changing in 2026

    Three trends to watch:

  • Streaming is becoming default. Non-streaming TTS endpoints feel sluggish; vendors are deprecating the batch-only paths.
  • Voice cloning regulation is tightening. Several jurisdictions now require consent records for cloned voices. Vendors with strong consent workflows have an emerging compliance moat.
  • On-device TTS is appearing. Apple's neural TTS and small open models are good enough for narrow use cases without a server hop. Most production voice will stay cloud-side, but the local path is real for the first time.
  • The model you pick today is not the model you will use in 18 months. Treat the integration like database driver code: well-encapsulated, swappable, with the model name as a config value. The TTS market is still moving.

    Related Posts