OmniDocBench Is Saturated - What Our 1,331-Page Benchmark Reveals About Real OCR Failures

Download printable cheat-sheet (CC-BY 4.0)

21 Mar 2026, 00:00 Z

TL;DR - OmniDocBench is saturating. GLM-OCR scores 94.6%, PaddleOCR-VL hits 94.5%, Hunyuan reaches 94.1%. Three models above 94% on a 1,355-page benchmark - and yet every one of them breaks on real scanned documents. Our 1,331-page benchmark on scan-heavy chemistry PDFs tells a different story: hallucinated chemical dosages, spaced-letter artifacts, collapsed table structures, and models that cannot detect a blank page. The gap between benchmark performance and production reliability is not closing. It is hiding.

The saturation problem

In March 2026, LlamaIndex's Jerry Liu flagged what many practitioners had already noticed: OmniDocBench is saturating. The top-ranked open OCR models now cluster above 94% accuracy on the benchmark, with less than a percentage point separating the leaders.

Model	Params	OmniDocBench (reported)
GLM-OCR	0.9B	94.62
PaddleOCR-VL-1.5	0.9B	94.50
HunyuanOCR	1B	94.10
FireRed-OCR	2B	92.94
DeepSeek-OCR-2	3B MoE	91.09

When the top three models are within half a point of each other, the benchmark has stopped being a useful discriminator. But the problem runs deeper than score compression.

Artifact type	Weight	Why
Duplicate lines	1x	Common but low-severity; usually a pagination artifact
Spaced-letter artifacts	2x	Breaks search and indexing; invisible to CER
Fake image references	3x	Model hallucinated a reference to an image that does not exist
Repeated suffix patterns	10x	Strong signal of model degeneration or looping

Model	Speed (sec/page)	Blank detection (of 3)	Strength	Main risk
FireRed-OCR	~3.4	Partial	Best balanced - lowest cleanup burden on text-first pages	Loses question-local visuals on diagram-dependent pages
GLM-OCR	~0.9	0/3	Fastest - best throughput for high-volume workflows	Noisier Markdown, blind to blank pages
HunyuanOCR	~6.6	Partial	Strongest grounded output - 1,517 visual anchors with coordinates	Slowest; high latency for interactive use
DeepSeek-OCR-2	~4.2	3/3	Best blank-page handling; second-strongest grounded workflow	Moderate speed, smaller coordinate vocabulary
dots.ocr-1.5	~3.8	Partial	Broadest scope - handles web, scene, and SVG-style content	Not the safest default for scan-heavy document OCR

OmniDocBench Is Saturated - What Our 1,331-Page Benchmark Reveals About Real OCR Failures

The saturation problem

Turn AI video into a repeatable engine

What OmniDocBench actually measures

What our benchmark tests that OmniDocBench does not

Blank-page blindness

Spaced-letter artifacts

Fused and bunched-up tokens

Chemical equation hallucination

Table structure collapse

Diagram-dependent pages

The evaluation methodology gap

Text artifact scoring

Coordinate-aware anchor matching

Per-archetype aggregation

Blank-page detection as a first-class metric

What the five-model benchmark showed

Why routing beats picking a winner

What needs to change in OCR evaluation

Semantic evaluation over exact-match

Domain-specific benchmark slices

Hallucination-specific metrics

Cost and latency alongside accuracy

Production failure modes as first-class test cases

The bottom line

Related Posts

The saturation problem

Turn AI video into a repeatable engine

What OmniDocBench actually measures

What our benchmark tests that OmniDocBench does not

Blank-page blindness

Spaced-letter artifacts

Fused and bunched-up tokens

Chemical equation hallucination

Table structure collapse

Diagram-dependent pages

The evaluation methodology gap

Text artifact scoring

Coordinate-aware anchor matching

Per-archetype aggregation

Blank-page detection as a first-class metric

What the five-model benchmark showed

Why routing beats picking a winner

What needs to change in OCR evaluation

Semantic evaluation over exact-match

Domain-specific benchmark slices

Hallucination-specific metrics

Cost and latency alongside accuracy

Production failure modes as first-class test cases

The bottom line

Related Posts

Running OpenAI Privacy Filter on an M2 MacBook Pro - 52-Case Benchmark

How Open-Source TTS Architectures Differ - And What It Means for Fine-Tuning (2026)

Build an AI YouTube Shorts Pipeline - Remotion + TTS + Automated Publishing