The Cost Economics of a Voice Minute in 2026

CallMissedMay 8, 2026

·6 min readArticle

A voice minute is the smallest unit of revenue and cost for any voice AI product. Understanding what it actually costs to deliver one — and where the costs hide — is the difference between a healthy unit economics story and a graveyard of voice agent startups. Here is the 2026 breakdown.

The headline number

End-to-end production voice agent cost in 2026: $0.12–$0.45 per conversation minute, all-in including telephony, STT, LLM, TTS, and platform overhead. Per industry pricing analyses the range is consistent across vendors.

The variance comes from three things: which TTS you use, how chatty your agent is, and whether you're routing through a managed platform or running close to the metal.

The component breakdown

Per Klariqo's 2026 cost-per-minute analysis and Softcery's calculator, the line items at typical pricing:

Speech-to-text

Cost: $0.003–$0.02 per minute.

Range: Whisper API at the cheap end, premium streaming STT (Deepgram Nova-3, AssemblyAI) at the higher end.

Typical share: under 10% of total cost.

STT is the cheapest component. Optimizing it rarely moves total cost meaningfully unless you're using premium STT in long-form conversations.

Large language model

Cost: $0.01–$0.04 per minute.

Range: Smaller models (Haiku, GPT-4.1 mini) at the low end; frontier models (Sonnet, GPT-class) at the higher end.

Typical share: 10–25% of total cost.

LLM cost depends heavily on prompt size and conversation history length. A 4000-token system prompt repeated every turn is a major hidden cost driver. [Inference]

Text-to-speech

Cost: $0.03–$0.10 per minute.

Range: Cheaper TTS (Azure, Google) at the low end; premium TTS (ElevenLabs, Cartesia) at the higher end.

Typical share: 30–50% of total cost.

TTS is the most expensive and most variable component. The temptation to use the highest-quality voice always pays a price per minute that scales with usage.

Telephony

Cost: $0.015–$0.05 per minute for inbound; more for outbound.

Twilio adds approximately $0.02/min to every call, per industry data, representing 35–50% of total COGS for the cheapest AI stacks.

Typical share: 15–30% of total cost.

Telephony is often the surprise cost. Teams optimize the AI stack and forget that the dialer line item alone can dominate.

Platform / infrastructure

Cost: $0.02–$0.05 per minute for managed platforms; less if you're running everything yourself but with the engineering overhead absorbed elsewhere.

Typical share: 5–20% depending on stack.

Managed platforms (Vapi, Retell, Bland) bundle this in a single per-minute number. Self-hosted stacks distribute it across compute, storage, monitoring, and engineering time.

Where costs hide

Five places the per-minute number can lie:

1. Idle time

A "5-minute conversation" includes user pauses, hold time, music-on-hold, and dead air. Some components bill on wall time (telephony) and some on processing time (STT). The wall-time billed components dominate for sparse conversations.

2. Long context windows

LLM cost grows with conversation history. A 30-turn conversation with 200 tokens per turn plus a 1500-token system prompt is roughly 7500 input tokens on every later turn. At even cheap rates, that's a meaningful per-minute multiplier.

3. Tool calls

Every tool call is an extra LLM round trip — sometimes two. For agents that look up a lot of state (calendar, CRM, knowledge base), tool-call overhead can be the second-largest LLM cost driver after history.

4. Streaming inefficiency

Streaming STT and TTS are billed per second of audio. Jittery, restarting streams (from network blips or aggressive interruption handling) can increase billed seconds beyond actual conversation length.

5. Failed call retries

Calls that fail mid-flight, get retried, or hand off mid-conversation can incur double billing — once for the failed attempt and once for the retry.

Scaling math

A worked example for a midsize deployment:

50,000 conversations/month, average 3 minutes per conversation.

Total: 150,000 voice minutes/month.

At $0.20/min all-in, total cost is $30,000/month or roughly $360,000/year.

At $0.30/min, that becomes $45,000/month or $540,000/year. The 50% delta between low-end and mid-tier pricing is real money.

Compare to human operations: [Inference] $1.50/min loaded for US agents puts the same volume at $225,000/month. Even with all of AI's hidden costs and overhead, the AI stack is roughly 5–10x cheaper.

Optimizing the per-minute number

Five high-leverage cost optimizations:

Trim system prompts. A 1500-token system prompt costs as much as a 1500-token user message — every turn. Audit ruthlessly.

Summarize long conversations. After 8–10 turns, compress older context to a 200-token summary.

Use a smaller model where possible. A 4B-parameter model that solves your tier-1 use cases at $0.005/min total LLM cost is dramatically cheaper than a frontier model.

Choose TTS by use case, not by demo quality. Cheaper TTS for routine flows; premium TTS for high-impact moments only.

Watch the telephony line. SIP trunking, regional pricing, and wholesale rates can lower telephony cost meaningfully at scale.

Pricing your product

If you're charging customers for voice minutes, the standard 2026 markup is 2–4x cost. A $0.20/min cost charged at $0.50–$0.80/min reads as competitive against legacy IVR costs and gives you margin to absorb support, sales, and product investment. [Inference]

Below 2x, you're squeezing margin to where any one bad month wipes you out. Above 4x, customers are doing the math themselves and migrating to in-house deployments.

The bottom line

A voice minute in 2026 costs $0.12–$0.45 to deliver, dominated by TTS and telephony, with LLM as a meaningful third and STT as a rounding error. The hidden costs — long context, tool calls, idle time, retries — can swing the number 30–50% if you're not watching. Optimize the right line items (TTS choice, prompt size, telephony rate) and the unit economics are durable. Optimize the wrong ones (squeezing 5% out of STT) and you're rearranging the deck chairs.

Frequently Asked Questions

Why is TTS the most expensive component?

Premium TTS providers (ElevenLabs, Cartesia, Hume) price aggressively because their voice quality is the user-perceived differentiator. Generic TTS is cheap; the voices users actually want are not. The tradeoff between cost and quality is the single biggest knob in voice agent economics.

Are managed platforms (Vapi, Retell) cheaper or more expensive than DIY?

They look more expensive per minute but absorb engineering, observability, and operations overhead. For early-stage teams or when speed-to-market matters, managed wins. For larger deployments where margin matters and you have voice engineering in-house, DIY usually wins.

What's the cheapest possible voice minute in 2026?

Around $0.08/min using Whisper/Deepgram cheaper tier, a small open-source LLM, Azure or Google TTS, and a SIP trunk instead of Twilio. The trade is quality and developer ergonomics. Below $0.10/min is achievable but most teams find the experience compromises don't justify it.