Sarvam Saaras V3: Why India's STT Beats Global Models

CallMissed
·5 min readReview

For most of the last decade, building voice products in Indian languages meant accepting that STT accuracy would be 30–50% worse than what English-language users took for granted. Code-mixing, accent variation, and 22 official languages with very different scripts conspired against the global ASR vendors. In early 2026, Sarvam AI's Saaras V3 closed most of that gap — and on Indian-specific benchmarks, opened a new one in the other direction.

What Saaras V3 actually claims

According to Sarvam's release blog, Saaras V3 outperforms Gemini 3 Pro, OpenAI's GPT-4o Transcribe, Deepgram Nova-3, and ElevenLabs Scribe v2 on Indian-language and Indian-accented English benchmarks. The headline numbers from Sarvam:

  • 19.31% WER on the IndicVoices benchmark across the 10 most popular Indian languages.
  • 6.37% WER on the Svarah Indian-English benchmark.
  • Native support for streaming speech recognition — partial transcripts emitted while audio is still arriving.
  • Coverage across all 22 scheduled Indian languages plus English.
  • The model was trained on over a million hours of curated multilingual audio, with a stated focus on low-resource languages and code-mixed speech, per Sarvam's training notes.

    Why global models struggle on Indian audio

    Three structural problems global STT vendors have on Indian-language audio:

  • Code-mixing. Indian speakers routinely switch between Hindi, English, and a regional language inside one sentence. Most global models trained on monolingual audio treat the second language as noise.
  • Accent variation. Indian-English alone has more inter-speaker accent variation than most monolingual benchmarks contend with.
  • Long-tail languages. Tamil, Telugu, Kannada, Malayalam, Marathi, Gujarati, Bengali, Punjabi each have tens of millions of speakers but limited training data in global audio sets. The 12 less-popular scheduled languages have even less.
  • Sarvam's training data is curated specifically against these problems. Per Business Standard's coverage, the gap widens further on the 12 lower-resource languages where competing models often produce heavily degraded transcriptions or no output.

    What "code-mixing" looks like in practice

    A typical Hinglish customer-support utterance: "मेरा order अभी तक deliver नहीं हुआ है, can you check the tracking?"

    A monolingual Hindi model sees the English words as out-of-vocabulary noise and either drops them or transcribes them in nonsense Devanagari. A monolingual English model does the inverse. Saaras V3 is trained to keep both languages in their native scripts within the same transcript — which is what downstream NLU models expect.

    Streaming, finally

    Earlier Saaras versions were batch-oriented. V3 adds native streaming, per Sarvam's API docs. This is the change that makes it usable in real-time voice agents — without streaming, end-of-utterance latency dominates and the user-experience falls apart on calls.

    In a streaming pipeline, the model emits partial transcripts as audio arrives, with the final transcript at endpoint. This lets the LLM start "thinking" sooner, similar to how Deepgram's "true streaming" architecture works on the English side, per Deepgram.

    Where Saaras V3 fits in a stack

    Use cases where Saaras V3 is the right pick in 2026:

  • Indian-market voice agents. Customer support, banking, healthcare, e-commerce — anywhere callers may speak Hindi, English, or any combination.
  • Indian-language transcription. Legal, government, media transcription where script-correct output matters.
  • Multilingual call analytics. Conversations recorded for QA where the system needs to handle code-switched dialogue without losing entire utterances.
  • Use cases where it is not the right pick:

  • Pure English content from non-Indian speakers — English-first models like Deepgram Nova-3 hold the edge on standard accent benchmarks.
  • Languages outside Indian-language scope — Sarvam doesn't claim coverage for European or East Asian languages.
  • How it compares vs. global models on English

    [Inference] On native-English audio, Saaras V3 is competitive but not category-leading. Deepgram Nova-3 reports a median WER of 6.84% on real-time English streams, per Deepgram. The 6.37% Svarah figure for Saaras V3 is on Indian-English specifically, which is harder than US-English. Direct apples-to-apples comparison on the same English-only benchmark sets is hard to find. [Unverified]

    Things to watch when evaluating

    If you're testing Saaras V3 against your own audio:

  • Test on real audio, not benchmark sets. Studio-clean audio benchmarks rarely reflect call-center quality.
  • Evaluate code-mixed utterances explicitly. This is where Sarvam's edge is largest.
  • Measure streaming latency end-to-end, including network and not just model inference.
  • Sample across all 22 languages if you serve them. Coverage is uneven; the headline 10-language number is a partial picture.
  • Confirm script-correct output — transcribing Hindi in Roman script vs Devanagari changes downstream NLU dramatically.
  • What's next

    Sarvam has signaled an Indus model line as a sovereign multilingual stack and continues to iterate Saaras and Bulbul together as a paired ASR/TTS family. [Speculation] V4 or a successor model is likely within the next 12 months.

    The bottom line

    Saaras V3 is the new default STT for Indian-language voice products in 2026. The benchmark wins are real, the streaming addition makes it production-grade, and the code-mixing handling is the feature you cannot easily replicate by switching global vendors. For Indian-market voice AI, this is the model to test first.

    Frequently Asked Questions

    Does Saaras V3 work for non-Indian languages?
    No. Coverage is the 22 scheduled Indian languages plus English with Indian-accent emphasis. For European or East Asian languages, you'll need a different vendor.
    How does Saaras V3 handle Hinglish vs. pure Hindi or pure English?
    It handles code-mixed Hinglish natively, emitting each word in its appropriate script within a single transcript. Monolingual global models typically treat the second language as noise and degrade.
    Is Saaras V3 streaming or batch?
    V3 added native streaming as of its 2026 release. It emits partial transcripts during audio capture and a final transcript at endpoint, suitable for real-time voice agents.

    Related Posts