Building Voice Agents on CallMissed: From WebRTC to Sub-Second Round-Trip

CallMissed
·5 min readGuide

A voice agent in 2026 is no longer a research demo. It is a real product surface — phone support, scheduling, in-app conversational UIs, embedded copilots — and the difference between one users tolerate and one users enjoy is almost entirely about latency and turn-taking. CallMissed gives you the production plumbing without the months of WebRTC tuning.

What CallMissed actually does

A voice agent has four moving parts:

  • A real-time audio transport so the user's microphone and the agent's audio actually reach each other with low jitter
  • A speech-to-text (STT) pass that streams partial transcripts as the user is still talking
  • A language model (LLM) that decides what to say back
  • A text-to-speech (TTS) pass that begins streaming audio as soon as the LLM emits its first tokens
  • CallMissed handles the transport (WebRTC over our hosted media servers) and orchestrates the STT → LLM → TTS pipeline behind one API. You pick the models; we run the room.

    The session lifecycle

    A voice session looks like this:

    Code
    POST /v1/voice/sessions
    → { session_id, livekit_url, livekit_token, stt_model, llm_model, tts_voice }

    Your client connects to the WebRTC room with the returned token. The agent process — running server-side — joins the same room, subscribes to your microphone track, and publishes its synthesized response track back. Both directions are bidirectional and continuous; the user can interrupt, the agent will stop speaking, and the next turn begins.

    Sessions are tenant-scoped. Each cm_* API key creates and manages sessions only for its own tenant. Session usage (minutes, model calls, audio bytes) is metered for billing and exposed via /api/v1/analytics.

    Where latency actually comes from

    Sub-second perceived latency is the goal. Time-to-first-audio (TTFA) is the metric that matters — how long after the user finishes speaking until the agent's voice begins. In 2026, a tight stack looks like:

  • Endpointing / VAD: 100–200ms (deciding "the user stopped")
  • STT finalization: 50–150ms (closing out a streaming partial)
  • LLM time-to-first-token: 200–400ms
  • TTS first audio chunk: 75–250ms
  • Network + jitter: 30–80ms
  • Adding these up: a well-tuned voice agent ships first audio in 600–900ms. CallMissed is configured for that envelope by default. The hot-path discipline (no DB writes, no pre-yield work) is enforced server-side so adding instrumentation does not silently regress latency.

    Picking models for your use case

    CallMissed exposes a curated catalog through a single API call:

    Code
    GET /api/v1/models?service=llm
    GET /api/v1/models?service=stt
    GET /api/v1/models?service=tts

    For each service we surface only the models we can route to in production. You set the llm_model, stt_model, and tts_voice per session. Common shapes:

  • Customer support / scheduling: balanced LLM + Indian-language-aware STT + a low-latency TTS voice
  • Long-form research conversations: larger LLM with extended context + an instructable TTS for tonal control
  • Outbound voice: a tighter prompt + the lowest-latency voice you can find
  • We do not lock you into a single backend. If the best STT for your customers is multilingual, pick that; if it is English-only, the catalog has those too.

    Knowledge bases and bots

    A voice agent is most useful when it knows things. CallMissed bots are a structured wrapper around a system prompt, a knowledge base, and a model configuration. Attach a knowledge base to a bot, attach the bot to a voice session, and your agent answers from your documents instead of generic web knowledge.

    Knowledge bases accept Markdown, PDF, and structured documents. Retrieval runs on each turn, scoped by tenant.

    What to build first

    The fastest way from zero to working voice agent on CallMissed:

  • Generate a cm_* API key with the voice scope
  • Create a bot with your system prompt and (optional) knowledge base
  • POST /v1/voice/sessions with the bot ID
  • Drop the LiveKit client SDK into your frontend, join the room
  • That is roughly 50 lines of frontend code and one server call. The hard parts — interruption handling, partial transcript stability, audio normalization — are already configured.

    What we do not do

    Two things are explicitly not CallMissed's job:

  • We do not write the conversation logic for you. System prompts, tool wiring, and turn flow are yours to design.
  • We do not own your audio. Recordings are session-scoped and tenant-isolated; export endpoints exist if you want them, deletion endpoints if you do not.
  • The next step

    If you are evaluating the voice stack, the playground at /playground lets you spin up a session against your tenant in under a minute. From there, the API surface is small enough to integrate end-to-end in an afternoon.

    Related Posts