gpt-realtime-mini
by OpenAI · Released 2025
OpenAI gpt-realtime-mini — cheaper speech-to-speech realtime model for voice agents.
gpt-realtime-mini
Powered by OpenAI · Realtime multimodal
Context Window
32K
Parameters
Not disclosed
Max Output
N/A
Category
LLM Chat
Overview
`gpt-realtime-mini` is OpenAI's cost-efficient realtime speech-to-speech model — the same unified audio-in/audio-out product pattern as `gpt-realtime`, tuned for lower cost (platform.openai.com/docs/models/gpt-realtime-mini). On CallMissed, select it as `llm_model=gpt-realtime-mini` when creating a voice session. Like its larger sibling, it does not appear on `/v1/chat/completions`; it is voice-pipeline only.
The model accepts text, image, and audio inputs and produces text and audio outputs with a 32,000-token context and up to 4,096 output tokens on the model card. OpenAI pricing lists substantially lower text token rates than full gpt-realtime ($0.60 input / $2.40 output per million text tokens on the model page — verify audio pricing separately). That makes realtime-mini attractive for consumer-facing assistants, internal help lines, and high-volume pilots where full realtime quality is unnecessary.
Choose gpt-realtime-mini when latency must stay conversational but absolute frontier reasoning is overkill — appointment booking, FAQ bots, guided troubleshooting with short turns, and language practice. You still benefit from one model handling end-of-turn detection server-side (`turn_detection="realtime_llm"` in our agent) rather than tuning VAD + STT + LLM + TTS yourself.
Voice selection uses the realtime voice allowlist (alloy, echo, shimmer, ash, ballad, coral, sage, verse, marin, cedar). Test on real devices: mobile networks add jitter that text-only APIs never expose. Monitor credits using voice session metrics — audio token accounting differs from chat.
Platform status may show maintenance when Azure realtime deployments lack quota; consult `/api/v1/models` before shipping. Fallback pattern: use `saaras:v3` + `gpt-4.1` + `bulbul:v3` pipeline models for Indian-language or batch-friendly voice when realtime is down.
Limitations: not a text API model, preview/GA lifecycle subject to OpenAI/Azure changes, and quality gap versus full gpt-realtime on complex reasoning mid-call. For transcription-only or TTS-only tasks, dedicated `gpt-4o-transcribe` / `gpt-4o-mini-tts` endpoints are simpler and cheaper.
Pilot economics: realtime-mini is designed so startups can run voice pilots without flagship realtime bills. Model a 5-minute average call, multiply by expected monthly calls, include audio token markup — compare against text chat fallback.
Use case fit: ideal for structured dialogs (collect name, address, confirm appointment) less suited for open-ended therapy or legal advice where full realtime quality matters.
Client implementation: WebRTC through LiveKit requires stable NAT traversal — test on corporate VPNs and mobile LTE. Provide reconnect UX; realtime sessions drop on network blips.
Quality ladder: start on realtime-mini; if CSAT or task completion falls below threshold, A/B test full `gpt-realtime` on a fraction of traffic.
Pairing with text backends: some teams use realtime-mini for voice UX but POST transcripts to `gpt-4.1` async for CRM notes — decouples latency-sensitive path from long analysis.
Quota awareness: same Azure realtime infrastructure as flagship — maintenance flags apply to both. Have a runbook.
Accessibility: voice-first interfaces help visually impaired users — ensure keyboard alternatives remain available for compliance (WCAG).
Go-to-market pattern: launch a "voice beta" tier on realtime-mini with usage caps, gather CSAT, upgrade power users to full realtime. Marketing sites can embed a demo widget — protect with CAPTCHA and per-IP rate limits because voice is abuse-prone.
Engineering stack: Next.js front-end with LiveKit React components, session tokens minted server-side only, never expose master API keys in browsers. Store minimal PII from voice sessions; transcripts may be optional opt-in.
Competitive positioning vs chained STT+LLM+TTS: realtime-mini reduces glue code and aligns end-of-turn detection with the speech model, often lowering perceived latency even if raw ASR+LLM+TTS benchmarks look similar on paper.
Failure modes: Azure maintenance flag — display banner in app; offer text chat fallback. Audio device permission denied — guide users with UI copy. Echo loops — enforce headphones in docs for desktop support tools.
Analytics: funnel from "start voice session" → "completed task" → "user returned within 7 days". Voice UX without analytics is flying blind.
Contractual SLA: unless CallMissed explicitly guarantees realtime availability in your enterprise agreement, do not promise 99.99% voice uptime to your customers — reference catalog maintenance status and maintain fallback modalities.
Token estimation worksheet: assume 150 words per minute spoken, convert to audio tokens using OpenAI's published rates, multiply by simultaneous sessions peak — finance teams use this to set prepaid credit bundles for voice SKUs. Document chosen fallback models in the same runbook so on-call engineers can switch tenants during outages without product redeploys. For developer onboarding, clone the CallMissed voice agent sample with `llm_model=gpt-realtime-mini` first — it is the lowest-cost way to validate WebRTC plumbing before promoting staging traffic to full `gpt-realtime`.
Pricing
| Metric | Price |
|---|---|
| Input /1M tokens | ₹1500.0000 |
| Output /1M tokens | ₹3000.0000 |
1 credit = ₹1 = $0.01 USD. Prices shown from provider; CallMissed passes through with ~35% markup.
Key Highlights
- Lower cost realtime
- Speech-to-speech
Technical Details
- Model id: gpt-realtime-mini
Strengths
- Cheaper than gpt-realtime
Limitations
- Voice-only surface
- Maintenance — quota pending
Use Cases
API Example
# Create a voice session with llm_model=gpt-realtime-mini
Endpoint: WebSocket /v1/voice/sessions · Model ID: gpt-realtime-mini
Try gpt-realtime-mini now
Get 1000 free API credits on signup. No credit card required.