AI, voice agents &amp; platform engineering

Load Balancing AI Workloads: Routing Across Providers

LLM providers go down. Rate limits hit. Regional latency spikes. New models ship and old models deprecate. By 2026 most production AI systems have stopped pretending a single provider is enough — the question has shifted from "which provider?" to "how do I route across providers reliably?" Why route…

Rate Limiting AI APIs: Strategies That Actually Work

Rate limiting an AI API is harder than rate limiting a regular API. A "request" can cost $0.0001 or $5.00 depending on prompt size, model, and output length. A noisy tenant can starve a paying tenant. An agent loop can fire 100 model calls per user action. The "100 requests per minute" rules from RE…

LoRA and Distillation: A Practical Guide for 2026

In 2026, a single consumer GPU is enough to specialize a 7B model on your domain in an afternoon. That is not a research milestone — it is the default. The two techniques that made it possible are LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA), with distillation as the cousin that compresses …

7 min read

The Complete 2026 Startup Credits Stack: Over $1M in Free Cloud, AI, and SaaS

If you are starting a company in 2026, the single biggest line item you can wipe off your runway is also the easiest one to apply for. Between cloud providers, AI labs, and SaaS vendors, a well-stacked startup can pull in well over $1M in free credits before paying for a single VM. Most founders lea…

Building Voice Agents on CallMissed: From WebRTC to Sub-Second Round-Trip

A voice agent in 2026 is no longer a research demo. It is a real product surface — phone support, scheduling, in-app conversational UIs, embedded copilots — and the difference between one users tolerate and one users enjoy is almost entirely about latency and turn-taking. CallMissed gives you the pr…

Drop-In OpenAI-Compatible API: Switch Models Without Rewriting Your Code

The OpenAI Chat Completions API has won the LLM API design war. Whether you like the schema or not, every serious SDK and tool now speaks it natively — openai-python, openai-node, the LangChain/LlamaIndex adapters, the Anthropic CLI's compat mode, even some local model runners. CallMissed's /v1/chat…

Anthropic-Compatible Messages API: Use Claude Without Vendor Lock-In

The Anthropic Messages API has its own design — a content-block model, system-prompt-as-top-level-field, native tool use, prompt caching, extended thinking. Apps built on Claude tend to use Anthropic's SDK directly, and migrating those apps usually means rewriting the call shape. CallMissed avoids t…

Multi-Tenant API Keys: Production-Grade Auth with cm_* Tokens

Most AI APIs treat keys as a binary: you have one, or you don't. That works for a hobby project. It does not work when you are deploying agents in production with separate environments, separate teams, separate budgets, and a security review in your future. CallMissed's cm API keys are designed for …

4 min read

Pin Your Models: A Survival Guide for Unstable AI Defaults in Production

OpenAI swapped the default ChatGPT model on May 5, 2026 — GPT-5.5 Instant replaced GPT-5.3 Instant. The change happened in under two weeks. Anything you were testing on the consumer surface the day before may have behaved differently the day after. This is not a one-off. It is the new default cadenc…

How Llama 4's Mixture-of-Experts Architecture Works

Meta's Llama 4 family is the first Llama generation to ship as a Mixture-of-Experts (MoE) architecture. That single design choice explains most of what's different about Scout and Maverick — including why both have "17 billion active parameters" but very different total parameter counts, and why the…

Interruption Handling in Voice Agents: The Hard Problem

The single most common reason voice agents feel "robotic" is not voice quality, latency, or even reasoning quality. It is interruption handling. A human conversation partner stops talking the moment you start. A bad voice agent talks over you, ignores you, or restarts in confusion. Interruption is t…

VAD and Endpointing: Why Your Voice Agent Feels Slow

If your voice agent feels sluggish, the culprit is almost never the LLM. It is endpointing — the silence-detection logic that decides "the user is done speaking, start processing." Most teams over-engineer their LLM stack and under-engineer their VAD and endpointing, then wonder why their pipeline f…

Building Multilingual Voice Agents in 2026

A multilingual voice agent is not a monolingual agent with extra language packs. It is an architectural choice that affects every layer of the stack. In 2026, the teams shipping multilingual voice agents successfully are the ones who treat language as a first-class routing dimension, not an aftertho…

WebRTC for Voice AI: A Practical Primer

WebRTC is the transport that almost every browser-based voice AI runs on. It is also the layer that most application teams treat as a black box until something breaks at 3am. This primer is the minimum viable understanding of WebRTC you need to ship voice agents in 2026 — enough to design well, debu…

Prompt Engineering for Voice Agents

Prompt engineering for voice is not prompt engineering for chat with a TTS bolted on. Voice has different constraints — latency budget, no formatting, interruption tolerance, listener attention span — that change every layer of how you write the prompt. The same prompt that produces excellent respon…

Conversation Design for Voice: From Script to Flow

Conversation design is the discipline that separates voice agents that are pleasant to use from voice agents that win lawsuits. The work happens before code: how should a turn unfold, what does the agent do when things go wrong, what is the persona, where does the conversation actually end. In 2026,…

Evaluating Voice Agents: Beyond Word Error Rate

Word Error Rate is the most-quoted metric in voice AI and the least useful for evaluating actual voice agents. WER measures STT accuracy on transcribed audio. It tells you nothing about whether your agent answered the user's question, finished the task, sounded natural, or kept the conversation aliv…

Building Your First MCP Server: A Step-by-Step Tutorial

The Model Context Protocol (MCP) has gone from an Anthropic side-project announced in late 2024 to the de-facto plumbing for tool-using agents in eighteen months. OpenAI, Google, and most major IDE vendors now speak it natively, and the official spec moved through several revisions in 2025, with a 2…

Agent Memory Architecture: Working, Episodic, Semantic

"Agent memory" is one of the most overloaded terms in the field. People mean radically different things: a chat-history buffer, a vector store of past sessions, a fact graph, or some custom hybrid. This matters because picking the wrong memory shape for the wrong job is the most common reason agents…