gpt-4o-mini-tts
by OpenAI · Released 2025
OpenAI gpt-4o-mini-tts — steerable TTS with `instructions` for tone and delivery. Six voices.
gpt-4o-mini-tts
Powered by OpenAI · Neural TTS
Context Window
N/A
Parameters
Not disclosed
Max Output
N/A
Category
Text to Speech
Overview
`gpt-4o-mini-tts` is OpenAI's steerable text-to-speech model built on GPT-4o mini — natural speech synthesis with an `instructions` field that controls delivery style (platform.openai.com/docs/models/gpt-4o-mini-tts and the text-to-speech guide). On CallMissed, POST to `/v1/audio/speech` with `"model": "gpt-4o-mini-tts"`, `input` text, and a `voice` name. Audio returns as MP3 or other supported formats per request parameters.
OpenAI documents up to 2,000 tokens of input text per request and pricing around $0.60 per million input text tokens and $12.00 per million output audio tokens on the model page. CallMissed lists $0.20 per 10K characters on our rate card for simpler budgeting. Streaming audio output is supported — start playback before the full clip generates.
Voices include alloy, echo, fable, onyx, nova, shimmer, plus newer voices such as ash, ballad, coral, sage, verse, marin, and cedar (see platform.openai.com/docs/guides/text-to-speech). The `instructions` parameter is the differentiator — specify "speak warmly and slowly", "news anchor tone", or "whisper" without SSML. Azure hosts gpt-4o-mini-tts on a regional endpoint; CallMissed routes automatically.
Use cases: voice agents paired with text LLMs, audiobook drafts, localized IVR, accessibility readouts, and product demos requiring emotional range. Compare to Deepgram Aura (`aura-2-en`) for English low-latency telephony at scale, and to Sarvam Bulbul for Indian languages — gpt-4o-mini-tts excels when instruction-following delivery matters more than microsecond latency.
Integration tips: chunk long articles below token limits, escape JSON properly in curl examples, and cache common phrases server-side to save credits. Test clipping on mobile browsers when streaming MP3. Pair with `gpt-4.1` or `gpt-4o` for dialog logic while TTS handles rendering.
Limitations: English-centric voice catalog, 2K token input cap, not the cheapest TTS (`melotts` and Aura are lower cost), and instructions can be over-interpreted — keep prompts concrete. Realtime speech-to-speech (`gpt-realtime`) bypasses separate TTS when you need duplex conversation.
Instruction crafting: the `instructions` field accepts prosody guidance — "Speak like a calm airline pilot" beats vague "sound nice". Keep instructions under a few sentences to avoid conflicting with input text.
Voice branding: pick one voice per product line and stick to it — users recognize sonic brand identity. Test cedar/marin vs classic alloy for your demographic.
Chunking long content: novels exceed 2K token limits — split by paragraph, generate clips, concatenate with ffmpeg in post. For real-time agents, generate sentence-by-sentence to minimize latency.
Streaming playback: begin audio playback on first streamed bytes; buffer 200–500 ms to avoid underruns on mobile.
Multilingual limitation: for Hindi/Tamil voice, use Sarvam Bulbul — gpt-4o-mini-tts is English-centric on OpenAI voice list.
Cost comparison table (mental model): melotts cheapest, Aura next, gpt-4o-mini-tts premium for steerability, realtime highest for duplex.
Caching spoken prompts: IVR trees repeat the same phrases — cache MP3s at CDN instead of regenerating TTS every call.
Accessibility: provide text transcripts alongside TTS for hearing-impaired users even in voice-first flows.
OpenAI TTS guide alignment: voices documented at platform.openai.com/docs/guides/text-to-speech — verify voice names against our allowlist before shipping UI pickers. Instructions parameter documented alongside SSML-free steering — experiment A/B on hold music replacement projects.
IVR replacement ROI: traditional studio voiceover costs thousands per language; TTS prompts cost cents — regenerate copy instantly when compliance updates phrasing.
Audio post-processing: apply loudness normalization (-16 LUFS podcast standard) across clips for consistent playlists. Noise gate optional for breath sounds.
Developer pitfalls: unescaped quotes in JSON bodies break curl examples — use jq or SDKs. Very long `input` strings truncate — chunk and stitch.
Voice agent pairing: text LLM (`gpt-4.1`) decides what to say; mini TTS renders — separates logic from speech prosody. Realtime models merge both when latency budget allows one vendor stack.
Enterprise voice brand: document chosen voice in brand guidelines alongside color palette — consistency matters in omnichannel support (phone, app, kiosk).
Monitoring: track TTS failure rate, average characters per request, cost per thousand characters, audio duration distribution.
Future voices: OpenAI adds voices periodically — feature-flag new voices in beta before global rollout.
Rate limits: burst TTS during notifications (1000 users simultaneously) needs queueing on your side — CallMissed keys share tenant rate limits; smooth spikes with SQS/Kafka backed workers. Load-test Halloween-scale notification bursts before Black Friday — TTS queues are a common hidden bottleneck in e-commerce apps. Document default voice and instructions in your API reference so third-party integrators reproduce your sonic brand without guesswork.
Pricing
| Metric | Price |
|---|---|
| Price /10K chars | ₹20.0000 |
1 credit = ₹1 = $0.01 USD. Prices shown from provider; CallMissed passes through with ~35% markup.
Key Highlights
- Instructions field for tone
- 6 OpenAI voices
Technical Details
- Model id: gpt-4o-mini-tts
- POST /v1/audio/speech
Strengths
- Steerable delivery
- Natural quality
Limitations
- English-focused voices
Use Cases
API Example
curl https://api.callmissed.com/v1/audio/speech \
-H "Authorization: Bearer cm_YOUR_KEY" \
-d '{"model": "gpt-4o-mini-tts", "input": "Hello world", "voice": "alloy"}'Endpoint: POST /v1/audio/speech · Model ID: gpt-4o-mini-tts
Try gpt-4o-mini-tts now
Get 1000 free API credits on signup. No credit card required.