Sarvam Vision Just Beat Google Gemini and ChatGPT at Document OCR

Snehasish Konger

Founder & CEO

Business Guide

brain image

A 3-billion-parameter model from a Bengaluru startup just outscored Google Gemini 3 Pro, DeepSeek OCR v2, and OpenAI GPT on document OCR benchmarks. That's not a headline you read every day — especially when the model in question runs at a fraction of the size of its competitors.

This is Sarvam Vision. And the numbers are real.

What Is Sarvam Vision?

Sarvam AI, founded in 2023 and backed by the India AI Mission, launched Sarvam Vision on February 5, 2026. It is a 3-billion-parameter vision-language model (VLM) built on the Sarvam sovereign 3B base — a state-space architecture, not the transformer-decoder stack most global labs default to.

The model handles a range of visual understanding tasks: image captioning, scene text recognition, chart interpretation, and complex table parsing. Its primary design focus, though, is high-accuracy document intelligence — specifically for Indian-language documents, mixed-script layouts, and scanned archives.

The architecture consists of three components: the sovereign VLM itself, a semantic layout parser, and a reading order network. These two harness modules run alongside the VLM to interpret document structure, not just text content.

The Benchmark Numbers

Let's look at what Sarvam Vision actually scored before drawing any conclusions.

olmOCR-Bench (English-Only Subset)

olmOCR-Bench runs pass-fail unit tests on document-level OCR tasks. The tests are deterministically machine-verifiable — no subjective scoring. Sarvam evaluated 1,258 filtered samples from the 1,403-sample dataset, restricting to English documents only.

Category

Sarvam Vision

Gemini 3 Pro

GPT 5.2

DeepSeek OCR v2

Mistral OCR 3

ArXiv Math

86.5

70.6

61.0

81.9

85.4

Base

99.6

99.8

99.8

99.8

99.9

Header/Footer

96.3

84.0

75.6

95.6

93.8

Tiny Text

91.0

90.3

62.2

88.7

88.9

Multi-Column

82.2

79.2

70.2

83.6

82.1

Old Scans

49.8

47.5

34.6

33.7

48.8

Old Math

81.0

84.9

75.8

68.8

68.3

Tables

88.3

84.9

79.0

78.1

86.1

Overall: Sarvam Vision — 84.3% | Gemini 3 Pro — 80.2% | GPT 5.2 — 69.8%

Sarvam Vision leads in six of eight categories. It trails only on Base documents (by 0.3%) and Old Math (by 3.9%). Its strongest advantage comes in ArXiv Math (+15.9% over Gemini 3 Pro) and Tables (+3.4%).

OmniDocBench V1.5

This benchmark evaluates document parsing across varied layout types: academic papers, financial reports, and handwritten notes. Evaluated on the official English-only split of 628 samples, Sarvam Vision achieved 93.28% accuracy — reportedly leading this benchmark as well.

The Indic OCR Bench — The More Significant Story

The global benchmarks are interesting. The Sarvam Indic OCR Bench is where the real story lives.

Sarvam built this benchmark specifically because no Indic-standard equivalent of olmOCR-Bench existed. The dataset contains 20,267 samples across all 22 officially scheduled Indian languages, sourced from documents spanning 1800 to present. Scan quality varies deliberately. Samples are curated at the semantic block level, and accuracy is measured as word accuracy: 100 × (1 − WER).

Here's what happens to global models when scripts shift away from Latin characters:

Language

Sarvam Vision

Gemini 3 Pro

Google Cloud Vision

GPT 5.2

Hindi

95.91

95.12

90.94

84.86

Bengali

92.61

90.79

88.23

70.52

Tamil

93.42

92.73

89.69

61.87

Telugu

87.70

85.32

82.58

GPT 5.2 drops to 61.87% on Tamil. For less-resourced Indic scripts — Odia, Kashmiri, Maithili — the gap widens further. Sarvam Vision remains stable across all 22 languages.

Think about the engineering implications: a 3B model trained on India-specific data maintains higher accuracy across 22 scripts than models 10x its size trained on global corpora. That says more about data quality than model size.

How Did a 3B Model Beat Systems Many Times Larger?

This is the right question to ask. Three factors explain it.

Domain-specific data curation. Sarvam's training dataset pulled from scientific literature, financial documents, government bulletins, historical manuscripts, textbooks, magazines, and newspapers — all covering Indian-language content. For each domain, they generated data specific to the task type. Chart-text pairs focused on structured extraction and analysis. Table-parsing data prioritised structure and relationship recognition of cells. This is not general pretraining data with Indic documents mixed in — it's purpose-built.

Continual pretraining + SFT + RLVR. The training pipeline ran continual pretraining on the Sarvam sovereign 3B base, followed by supervised fine-tuning, then reinforcement learning with verifiable rewards (RLVR). RLVR uses reward signals tied to deterministically checkable outputs — the same philosophy behind olmOCR-Bench's pass-fail structure. The model learns from verifiable correctness, not human preference scores.

State-space architecture for inference efficiency. State-space models (SSMs) process sequences with linear time complexity rather than the quadratic complexity of attention-based transformers. For long documents with dense layouts, this translates directly to faster inference per token at lower memory cost. Running a 3B SSM in production is a different economics calculation than running a 70B transformer.

What Developers Should Pay Attention To

If you build document processing pipelines, a few things here are practically relevant.

Benchmark scope matters. olmOCR-Bench filtered to English-only documents for this evaluation. Sarvam Vision's edge on global benchmarks reflects strong general document understanding — but that edge is far larger on Indic content. If your pipeline processes Indian-language documents, the accuracy delta is not marginal. It's the difference between a usable extraction and a broken one.

Model size is not a proxy for capability. The assumption that larger models produce better outputs for specialized tasks does not hold here. A 3B model, trained on the right data with the right objective, outperforms 70B+ systems on a specific domain. This has direct infrastructure cost implications. Sarvam Vision's inference cost runs significantly lower than running frontier-size models through an API.

The benchmark trust question. Adithya S. Kolavi, founder of CognitiveLab and creator of the Indic LLM Leaderboard, tested Sarvam Vision independently and confirmed the performance level. Debarghya Das, Partner at Menlo Ventures, publicly walked back earlier criticism of Sarvam's Indic-language focus. Independent validation from technically credible sources is a meaningful signal. That said, Sarvam both trained the model and created the Indic OCR Bench — independent third-party replication of those Indic benchmark results matters for full confidence.

The Broader Context: Sovereign AI Infrastructure

Sarvam Vision is part of a larger programme. India's AI Mission selected Sarvam AI as the first startup to build India's foundational LLM — a planned 120-billion-parameter model trained on over 17 trillion tokens, with 15–20% of training data originating from India. Current open-source models include less than 1% Indian-origin data.

The India AI Impact Summit, held on February 16, 2026, in New Delhi, positioned Sarvam Vision alongside BharatGen's Param 2 (a 17B-parameter MoE model trained for 22 Indian languages) as evidence that India's AI ecosystem has moved from experimentation to production-grade output.

This is not just national-pride positioning. From a systems perspective, a sovereign AI infrastructure means lower latency for regional deployments, data residency compliance for regulated industries, and models that treat Indic language understanding as a first-class problem rather than a fine-tuning afterthought.

Where the Gaps Still Exist

Accuracy on Old Scans sits at 49.8% for Sarvam Vision. That's the highest score in the benchmark — but it means roughly half of historical scanned documents produce incorrect OCR output. For digitization pipelines handling archival content, that error rate requires human review at scale.

Multi-column layouts show the tightest competition: Sarvam Vision scores 82.2% versus DeepSeek OCR v2's 83.6%. That's a category worth watching.

Adithya Kolavi stated the broader challenge plainly: independent governance, shared datasets, and transparent evaluation protocols across the Indic AI space remain unresolved. Sarvam built its own benchmark to fill a gap — which is valuable — but industry-wide reproducibility requires open, standardised evaluation frameworks that no single company controls.

The Signal for the OCR and Document Intelligence Space

What should you take from this?

General-purpose frontier models are not automatically the right tool for domain-specific document extraction. For Indic document pipelines, Sarvam Vision presents a credible production option — 3B parameters, strong benchmark performance, API access available through Sarvam's developer dashboard, and explicit coverage of all 22 Indian languages.

For English-only document processing, the picture is more competitive. Mistral OCR 3 scores 85.4% overall on olmOCR-Bench versus Sarvam Vision's 84.3%. The gap is narrow. Your choice between them should come down to specific document types and infrastructure constraints, not the headline score.

The larger point: a focused model trained on the right data consistently beats a general-purpose model on specialized tasks. You've seen this pattern in code generation, legal document analysis, and biomedical NLP. Document OCR for regional scripts follows the same logic.

Conclusion

Sarvam Vision achieving 84.3% on olmOCR-Bench and 93.28% on OmniDocBench V1.5 — while outperforming Gemini 3 Pro, GPT 5.2, and DeepSeek OCR v2 — is a result worth examining carefully. The model is 3 billion parameters. The competitors it beat run at significantly larger scales.

The Indic OCR Bench results tell an even clearer story: for regional-script document understanding, global models leave substantial accuracy on the table. Sarvam Vision does not.

For developers building document extraction systems that touch Indian-language content — government forms, financial filings, legal documents, historical archives — this benchmark performance translates directly to fewer post-processing corrections and higher throughput. That's the practical outcome.

Does your current OCR pipeline degrade meaningfully on non-Latin scripts? If yes, these benchmark numbers should prompt a re-evaluation.

Share on social media