OmniDocBench Is Saturated - What Our 1,331-Page Benchmark Reveals About Real OCR Failures

Download printable cheat-sheet (CC-BY 4.0)

21 Mar 2026, 00:00 Z

TL;DR - OmniDocBench is saturating. GLM-OCR scores 94.6%, PaddleOCR-VL hits 94.5%, Hunyuan reaches 94.1%. Three models above 94% on a 1,355-page benchmark - and yet every one of them breaks on real scanned documents. Our 1,331-page benchmark on scan-heavy chemistry PDFs tells a different story: hallucinated chemical dosages, spaced-letter artifacts, collapsed table structures, and models that cannot detect a blank page. The gap between benchmark performance and production reliability is not closing. It is hiding.

The saturation problem

In March 2026, LlamaIndex's Jerry Liu flagged what many practitioners had already noticed: OmniDocBench is saturating. The top-ranked open OCR models now cluster above 94% accuracy on the benchmark, with less than a percentage point separating the leaders.

ModelParamsOmniDocBench (reported)
GLM-OCR0.9B94.62
PaddleOCR-VL-1.50.9B94.50
HunyuanOCR1B94.10
FireRed-OCR2B92.94
DeepSeek-OCR-23B MoE91.09

When the top three models are within half a point of each other, the benchmark has stopped being a useful discriminator. But the problem runs deeper than score compression.

AI video production

Turn AI video into a repeatable engine

Build an AI-assisted video pipeline with hook-first scripts, brand-safe edits, and multi-platform delivery.