Text to Speechsteerable

gpt-4o-mini-tts

by OpenAI · Released 2025

OpenAI gpt-4o-mini-tts — steerable TTS with `instructions` for tone and delivery. Six voices.

Text to Speech

gpt-4o-mini-tts

Powered by OpenAI · Neural TTS

Context Window

N/A

Parameters

Not disclosed

Max Output

N/A

Category

Text to Speech

Overview

`gpt-4o-mini-tts` is OpenAI's steerable text-to-speech model built on GPT-4o mini — natural speech synthesis with an `instructions` field that controls delivery style (platform.openai.com/docs/models/gpt-4o-mini-tts and the text-to-speech guide). On CallMissed, POST to `/v1/audio/speech` with `"model": "gpt-4o-mini-tts"`, `input` text, and a `voice` name. Audio returns as MP3 or other supported formats per request parameters.

OpenAI documents up to 2,000 tokens of input text per request and pricing around $0.60 per million input text tokens and $12.00 per million output audio tokens on the model page. CallMissed lists $0.20 per 10K characters on our rate card for simpler budgeting. Streaming audio output is supported — start playback before the full clip generates.

Voices include alloy, echo, fable, onyx, nova, shimmer, plus newer voices such as ash, ballad, coral, sage, verse, marin, and cedar (see platform.openai.com/docs/guides/text-to-speech). The `instructions` parameter is the differentiator — specify "speak warmly and slowly", "news anchor tone", or "whisper" without SSML. Azure hosts gpt-4o-mini-tts on a regional endpoint; CallMissed routes automatically.

Use cases: voice agents paired with text LLMs, audiobook drafts, localized IVR, accessibility readouts, and product demos requiring emotional range. Compare to Deepgram Aura (`aura-2-en`) for English low-latency telephony at scale, and to Sarvam Bulbul for Indian languages — gpt-4o-mini-tts excels when instruction-following delivery matters more than microsecond latency.

Integration tips: chunk long articles below token limits, escape JSON properly in curl examples, and cache common phrases server-side to save credits. Test clipping on mobile browsers when streaming MP3. Pair with `gpt-4.1` or `gpt-4o` for dialog logic while TTS handles rendering.

Limitations: English-centric voice catalog, 2K token input cap, not the cheapest TTS (`melotts` and Aura are lower cost), and instructions can be over-interpreted — keep prompts concrete. Realtime speech-to-speech (`gpt-realtime`) bypasses separate TTS when you need duplex conversation.

Instruction crafting: the `instructions` field accepts prosody guidance — "Speak like a calm airline pilot" beats vague "sound nice". Keep instructions under a few sentences to avoid conflicting with input text.

Voice branding: pick one voice per product line and stick to it — users recognize sonic brand identity. Test cedar/marin vs classic alloy for your demographic.

Chunking long content: novels exceed 2K token limits — split by paragraph, generate clips, concatenate with ffmpeg in post. For real-time agents, generate sentence-by-sentence to minimize latency.

Streaming playback: begin audio playback on first streamed bytes; buffer 200–500 ms to avoid underruns on mobile.

Multilingual limitation: for Hindi/Tamil voice, use Sarvam Bulbul — gpt-4o-mini-tts is English-centric on OpenAI voice list.

Cost comparison table (mental model): melotts cheapest, Aura next, gpt-4o-mini-tts premium for steerability, realtime highest for duplex.

Caching spoken prompts: IVR trees repeat the same phrases — cache MP3s at CDN instead of regenerating TTS every call.

Accessibility: provide text transcripts alongside TTS for hearing-impaired users even in voice-first flows.

OpenAI TTS guide alignment: voices documented at platform.openai.com/docs/guides/text-to-speech — verify voice names against our allowlist before shipping UI pickers. Instructions parameter documented alongside SSML-free steering — experiment A/B on hold music replacement projects.

IVR replacement ROI: traditional studio voiceover costs thousands per language; TTS prompts cost cents — regenerate copy instantly when compliance updates phrasing.

Audio post-processing: apply loudness normalization (-16 LUFS podcast standard) across clips for consistent playlists. Noise gate optional for breath sounds.

Developer pitfalls: unescaped quotes in JSON bodies break curl examples — use jq or SDKs. Very long `input` strings truncate — chunk and stitch.

Voice agent pairing: text LLM (`gpt-4.1`) decides what to say; mini TTS renders — separates logic from speech prosody. Realtime models merge both when latency budget allows one vendor stack.

Enterprise voice brand: document chosen voice in brand guidelines alongside color palette — consistency matters in omnichannel support (phone, app, kiosk).

Monitoring: track TTS failure rate, average characters per request, cost per thousand characters, audio duration distribution.

Future voices: OpenAI adds voices periodically — feature-flag new voices in beta before global rollout.

Rate limits: burst TTS during notifications (1000 users simultaneously) needs queueing on your side — CallMissed keys share tenant rate limits; smooth spikes with SQS/Kafka backed workers. Load-test Halloween-scale notification bursts before Black Friday — TTS queues are a common hidden bottleneck in e-commerce apps. Document default voice and instructions in your API reference so third-party integrators reproduce your sonic brand without guesswork.

Pricing

MetricPrice
Price /10K chars₹20.0000

1 credit = ₹1 = $0.01 USD. Prices shown from provider; CallMissed passes through with ~35% markup.

Key Highlights

  • Instructions field for tone
  • 6 OpenAI voices

Technical Details

  • Model id: gpt-4o-mini-tts
  • POST /v1/audio/speech

Strengths

  • Steerable delivery
  • Natural quality

Limitations

  • English-focused voices

Use Cases

Voice agentsNarrationIVR

API Example

curl https://api.callmissed.com/v1/audio/speech \
  -H "Authorization: Bearer cm_YOUR_KEY" \
  -d '{"model": "gpt-4o-mini-tts", "input": "Hello world", "voice": "alloy"}'

Endpoint: POST /v1/audio/speech · Model ID: gpt-4o-mini-tts

Try gpt-4o-mini-tts now

Get 1000 free API credits on signup. No credit card required.