Sarvam Saaras V3: Why India's STT Beats Global Models
For most of the last decade, building voice products in Indian languages meant accepting that STT accuracy would be 30–50% worse than what English-language users took for granted. Code-mixing, accent variation, and 22 official languages with very different scripts conspired against the global ASR vendors. In early 2026, Sarvam AI's Saaras V3 closed most of that gap — and on Indian-specific benchmarks, opened a new one in the other direction.
What Saaras V3 actually claims
According to Sarvam's release blog, Saaras V3 outperforms Gemini 3 Pro, OpenAI's GPT-4o Transcribe, Deepgram Nova-3, and ElevenLabs Scribe v2 on Indian-language and Indian-accented English benchmarks. The headline numbers from Sarvam:
The model was trained on over a million hours of curated multilingual audio, with a stated focus on low-resource languages and code-mixed speech, per Sarvam's training notes.
Why global models struggle on Indian audio
Three structural problems global STT vendors have on Indian-language audio:
Sarvam's training data is curated specifically against these problems. Per Business Standard's coverage, the gap widens further on the 12 lower-resource languages where competing models often produce heavily degraded transcriptions or no output.
What "code-mixing" looks like in practice
A typical Hinglish customer-support utterance: "मेरा order अभी तक deliver नहीं हुआ है, can you check the tracking?"
A monolingual Hindi model sees the English words as out-of-vocabulary noise and either drops them or transcribes them in nonsense Devanagari. A monolingual English model does the inverse. Saaras V3 is trained to keep both languages in their native scripts within the same transcript — which is what downstream NLU models expect.
Streaming, finally
Earlier Saaras versions were batch-oriented. V3 adds native streaming, per Sarvam's API docs. This is the change that makes it usable in real-time voice agents — without streaming, end-of-utterance latency dominates and the user-experience falls apart on calls.
In a streaming pipeline, the model emits partial transcripts as audio arrives, with the final transcript at endpoint. This lets the LLM start "thinking" sooner, similar to how Deepgram's "true streaming" architecture works on the English side, per Deepgram.
Where Saaras V3 fits in a stack
Use cases where Saaras V3 is the right pick in 2026:
Use cases where it is not the right pick:
How it compares vs. global models on English
[Inference] On native-English audio, Saaras V3 is competitive but not category-leading. Deepgram Nova-3 reports a median WER of 6.84% on real-time English streams, per Deepgram. The 6.37% Svarah figure for Saaras V3 is on Indian-English specifically, which is harder than US-English. Direct apples-to-apples comparison on the same English-only benchmark sets is hard to find. [Unverified]
Things to watch when evaluating
If you're testing Saaras V3 against your own audio:
What's next
Sarvam has signaled an Indus model line as a sovereign multilingual stack and continues to iterate Saaras and Bulbul together as a paired ASR/TTS family. [Speculation] V4 or a successor model is likely within the next 12 months.
The bottom line
Saaras V3 is the new default STT for Indian-language voice products in 2026. The benchmark wins are real, the streaming addition makes it production-grade, and the code-mixing handling is the feature you cannot easily replicate by switching global vendors. For Indian-market voice AI, this is the model to test first.