OmniDocBench Is Saturated - What Our 1,331-Page Benchmark Reveals About Real OCR Failures

Download printable cheat-sheet (CC-BY 4.0)

21 Mar 2026, 00:00 Z

This post answers a benchmark question: if top OCR models now score above 94% on OmniDocBench, why do they still fail in production.

The short version

  • OmniDocBench is still useful, but it is no longer enough to choose a production OCR system by itself.
  • Several top OCR models now cluster above 94%, so small leaderboard gaps do not explain real workflow risk.
  • On our 1,331-page scan-heavy benchmark, the same model families still produced practical failures: invented chemistry text, spaced-out words, broken tables, and missed blank pages.
  • The lesson is simple: a high public benchmark score does not tell you how the model fails on your documents.

The one-minute decision path

Use public benchmarks as a starting filter, then test the failure modes that matter to your corpus.

Public benchmark tells you...Your own benchmark must still test...
whether a model is broadly capablewhether it fails on your page types
text and table performance on benchmark pagesblank pages, scan artifacts, diagrams, and noisy tables
relative leaderboard positioncleanup cost and production failure modes
whether a model deserves a shortlist slotwhether it should be routed, rejected, or gated by review

That is why this post moves from OmniDocBench saturation to our scan-heavy benchmark, then to routing.

Quick definitions:

  • OmniDocBench is a public benchmark for document OCR and layout parsing.
  • Saturation means many models score so close together that the benchmark no longer separates them clearly.
  • CER means character error rate. Lower is better.

AI video production

Turn AI video into a repeatable engine

Build an AI-assisted video pipeline with hook-first scripts, brand-safe edits, and multi-platform delivery.