Building Voice Agents on CallMissed: From WebRTC to Sub-Second Round-Trip

CallMissedMay 8, 2026

·5 min readGuide

Voice AI CallMissed Features WebRTC Real-time AI Developer Guide

A voice agent in 2026 is no longer a research demo. It is a real product surface — phone support, scheduling, in-app conversational UIs, embedded copilots — and the difference between one users tolerate and one users enjoy is almost entirely about latency and turn-taking. CallMissed gives you the production plumbing without the months of WebRTC tuning.

What CallMissed actually does

A voice agent has four moving parts:

A real-time audio transport so the user's microphone and the agent's audio actually reach each other with low jitter

A speech-to-text (STT) pass that streams partial transcripts as the user is still talking

A language model (LLM) that decides what to say back

A text-to-speech (TTS) pass that begins streaming audio as soon as the LLM emits its first tokens

CallMissed handles the transport (WebRTC over our hosted media servers) and orchestrates the STT → LLM → TTS pipeline behind one API. You pick the models; we run the room.

The session lifecycle

A voice session looks like this:

Code

POST /v1/voice/sessions
→ { session_id, livekit_url, livekit_token, stt_model, llm_model, tts_voice }

Your client connects to the WebRTC room with the returned token. The agent process — running server-side — joins the same room, subscribes to your microphone track, and publishes its synthesized response track back. Both directions are bidirectional and continuous; the user can interrupt, the agent will stop speaking, and the next turn begins.

Sessions are tenant-scoped. Each cm_* API key creates and manages sessions only for its own tenant. Session usage (minutes, model calls, audio bytes) is metered for billing and exposed via /api/v1/analytics.

Where latency actually comes from

Sub-second perceived latency is the goal. Time-to-first-audio (TTFA) is the metric that matters — how long after the user finishes speaking until the agent's voice begins. In 2026, a tight stack looks like:

Endpointing / VAD: 100–200ms (deciding "the user stopped")

STT finalization: 50–150ms (closing out a streaming partial)

LLM time-to-first-token: 200–400ms

TTS first audio chunk: 75–250ms

Network + jitter: 30–80ms

Adding these up: a well-tuned voice agent ships first audio in 600–900ms. CallMissed is configured for that envelope by default. The hot-path discipline (no DB writes, no pre-yield work) is enforced server-side so adding instrumentation does not silently regress latency.

Picking models for your use case

CallMissed exposes a curated catalog through a single API call:

Code

GET /api/v1/models?service=llm
GET /api/v1/models?service=stt
GET /api/v1/models?service=tts

For each service we surface only the models we can route to in production. You set the llm_model, stt_model, and tts_voice per session. Common shapes:

Customer support / scheduling: balanced LLM + Indian-language-aware STT + a low-latency TTS voice

Long-form research conversations: larger LLM with extended context + an instructable TTS for tonal control

Outbound voice: a tighter prompt + the lowest-latency voice you can find

We do not lock you into a single backend. If the best STT for your customers is multilingual, pick that; if it is English-only, the catalog has those too.

Knowledge bases and bots

A voice agent is most useful when it knows things. CallMissed bots are a structured wrapper around a system prompt, a knowledge base, and a model configuration. Attach a knowledge base to a bot, attach the bot to a voice session, and your agent answers from your documents instead of generic web knowledge.

Knowledge bases accept Markdown, PDF, and structured documents. Retrieval runs on each turn, scoped by tenant.

What to build first

The fastest way from zero to working voice agent on CallMissed:

Generate a cm_* API key with the voice scope

Create a bot with your system prompt and (optional) knowledge base

POST /v1/voice/sessions with the bot ID

Drop the LiveKit client SDK into your frontend, join the room

That is roughly 50 lines of frontend code and one server call. The hard parts — interruption handling, partial transcript stability, audio normalization — are already configured.

What we do not do

Two things are explicitly not CallMissed's job:

We do not write the conversation logic for you. System prompts, tool wiring, and turn flow are yours to design.

We do not own your audio. Recordings are session-scoped and tenant-isolated; export endpoints exist if you want them, deletion endpoints if you do not.

The next step

If you are evaluating the voice stack, the playground at /playground lets you spin up a session against your tenant in under a minute. From there, the API surface is small enough to integrate end-to-end in an afternoon.

GuideMay 8, 2026

Drop-In OpenAI-Compatible API: Switch Models Without Rewriting Your Code

GuideMay 8, 2026

Anthropic-Compatible Messages API: Use Claude Without Vendor Lock-In