We use essential cookies to run Instavar and optional analytics cookies to understand how the site is used. Reliability monitoring remains active to keep the service secure and available. Cookie Policy
Manage Cookie Preferences
Service reliability telemetry, including Sentry error monitoring and Vercel Speed Insights, stays enabled so we can secure the product and diagnose failures.
This OCR benchmark leaderboard answers a practical shortlist question: which OCR models deserve your first test in 2026, and what should you check before trusting a public leaderboard.
If you searched for OCR benchmark 2026, OCR leaderboard, or best OCR models 2026, start here. This page is the market map. The workflow page turns that shortlist into a deployment route, and the scan-heavy guide shows the page-type evidence behind the routing calls.
By February 2026, open OCR had become crowded enough that benchmark headlines were no longer enough on their own. Several compact vision-language models could already parse documents well. The harder question became where each one breaks.
For model-vs-model comparisons, use the merged workflow guide for GLM OCR vs PaddleOCR, GLM OCR vs Mistral OCR, and dots.ocr-1.5 routing decisions. The old dots comparison URL redirects there because the durable comparison logic now lives in one canonical OCR guide.
What we found
The top reported OCR models are now close enough on headline benchmarks that production fit matters more than tiny score gaps.
GLM-OCR and PaddleOCR-VL-1.5 still belong in the reported OmniDocBench shortlist.
Our hands-on read is more practical: Hunyuan is strongest when coordinates matter, DeepSeek helps when blank-page handling matters, FireRed is the best balanced operational choice, and GLM remains the fastest normal-case workflow.
dots.ocr-1.5 belongs in the OCR plus broader visual parsing lane, not as the default scanned-PDF model.
Use this page to build the first shortlist, then run a fixed page-type bake-off before rollout.
Update (Mar 2026): The public shortlist should now be read with a second layer in mind: our newer full-50 workflow benchmark across Hunyuan, DeepSeek, GLM, and FireRed. That benchmark does not replace the public leaderboard tables below, but it does change the deployment readout: Hunyuan leads on grounded output, DeepSeek is now the second grounded workflow and the strongest blank-page detector, FireRed remains the best balanced workflow, and GLM remains the fastest normal-case path. For the practical routing answer across those workflows plus
Turn AI video into a repeatable engine
Build an AI-assisted video pipeline with hook-first scripts, brand-safe edits, and multi-platform delivery.
OCR has converged with compact VLM design, and in some workflows these models reduce or replace parts of multi-stage OCR pipelines.
Benchmarks increasingly reward document-level understanding, not just line-level text extraction.
Open releases now include practical deployment paths (vLLM, SGLang, Hugging Face, and in some cases Ollama), reducing integration friction.
3 What the benchmark evidence says (reported)
Before comparing model cards, keep three filters in mind:
Compare only like-for-like benchmarks.
Treat low-sample live leaderboard results as directional, not final.
Validate on your own corpus before production promotion.
3.0 Why one OCR leaderboard score is not enough
A single aggregate OCR score usually hides the failure that will cost you time in production.
That is the pattern builders keep running into when they compare OCR models outside a neat leaderboard. In one Reddit benchmark discussion, users pushed for cost-per-success, latency, and open-model comparisons rather than only flagship model accuracy. In a PaddleOCR-VL-1.5 discussion, users reported strong benchmark scores but still called out table failures, repetition, CPU slowness, and local hardware questions. In a scanned PDF extraction thread, the pain was not that OCR returned no text. The pain was that tables broke, columns shifted, numbers were misread, and the output still needed manual checking.
Read leaderboard scores as a shortlist signal, then test the workflow dimensions that actually fail:
OCR text accuracy: can it read the characters?
Table extraction: can it preserve rows, columns, merged cells, and numeric alignment?
Key information extraction: can it pull fields without moving values into the wrong label?
Visual QA: can it answer questions about diagrams, figures, stamps, signatures, and local images?
Long-document handling: does quality hold after many pages, not just one demo page?
Latency and cost: does the model still make sense at 1,000 pages?
Field-level reliability: can a reviewer trace uncertain values back to the page before they enter a downstream system?
That is why this page is a leaderboard and not a final answer. Use it to build the shortlist, then use the workflow guide and scanned-PDF guide to decide which model should handle each page type.
3.1 OmniDocBench snapshot
The table below consolidates reported OmniDocBench scores from model papers/cards, using v1.5 where explicitly stated.
Model
Params
OmniDocBench (reported)
Notes
Source
GLM-OCR
0.9B
94.62
Strong all-round reported score; very recent release
Official March 2026 release frames it as the strongest end-to-end solution in its comparison slice; structural Markdown focus is the main differentiator
The top reported scores are now close enough that cost, failure mode, and licensing often matter more than a small benchmark gap.
3.5 FireRed-OCR early evidence snapshot
The FireRed-OCR launch matters because it includes both a technical paper and a benchmark framing centered on structural integrity rather than only text recognition.
FireRed-OCR is not the overall reported OmniDocBench leader, but it is now one of the clearest structure-first challengers in the open OCR field.
If your bottleneck is malformed Markdown or broken document syntax rather than pure text recognition, it belongs in the evaluation lane immediately.
3.6 What hands-on evaluation changed
Public benchmark tables are useful, but real scanned documents can still reorder the shortlist once page type and wrapper quality enter the picture. In our scan-heavy pilot, that is exactly what happened. This page should stay the market map and shortlist, not the final routing answer. For the routing rule and the underlying evidence, use:
3.7 What the newer four-model workflow benchmark changed
The newer full-50 workflow benchmark adds a second layer on top of the public leaderboard story because it compares real operational entrypoints rather than just reported paper/model-card numbers.
Workflow
Mean sec/page
Blank pages detected
Total visual anchors
Practical readout
FireRed
3.328
2/3
48
Best balanced workflow
GLM
1.252
0/3
57
Fastest normal-case workflow
Hunyuan
6.884
2/3
1517
Strongest grounded workflow
DeepSeek
17.591
3/3
926
Second grounded workflow; strongest blank handling
Interpretation:
Hunyuan now has the strongest practical case when grounded structure matters more than speed.
DeepSeek is no longer just a markdown-oriented curiosity. It is now the second grounded workflow in the measured stack, although it is also the slowest.
FireRed remains the best balanced operational choice when you want a cleaner markdown-oriented workflow.
GLM remains the fastest typical path, but it is still weak on blank-page handling.
3.2 OlmOCR-Bench snapshot
Reported from LightOnOCR-2 benchmarking (headers/footers excluded setting):
The Elo setup is judged by Gemini 3 Flash in the authors' pipeline and is not a drop-in replacement for independent leaderboard results.
4 Model fit by use case
4.1 Use-case fit matrix
Model
Choose first when
Why it wins there
Watch-outs
HunyuanOCR
You need dense grounded output for extraction or audit-heavy workflows
Strongest grounded workflow in the current full-50 hands-on benchmark
Slower than GLM or FireRed; raw output usually needs more normalization
DeepSeek-OCR-2
You need stronger grounding than GLM or FireRed plus strict blank-page handling
Second grounded workflow in the current hands-on benchmark and the only one to detect 3/3 blank pages
Slowest current workflow; current helper job adds startup overhead
GLM-OCR
You need a strong default baseline across mixed documents
Top-tier reported OmniDocBench result in compact size; multiple serving paths
Very new release; long-tail behavior still needs broad replication
dots.ocr-1.5
You need one model for OCR plus web/screen/scene/SVG parsing
Broad task coverage in a single 3B model family and strong reported release benchmarks
Many benchmark claims are currently model-card/repo reported for this version
FireRed-OCR
You need stricter structural Markdown behavior with formulas and tables
Public training story explicitly targets structural hallucination and syntactic validity
Early-cycle release; benchmark evidence is still author-reported and needs broad replication
DeepSeek-OCR-2
You need markdown-oriented output and mode switching
Reading-order-focused design and dual extraction modes (Free OCR and structured conversion)
Validate complex tables and multilingual edge cases on your own corpus
LightOnOCR-2-1B
You process high page volume and care about cost per page
Strong reported OlmOCR-Bench + throughput profile at 1B scale
Check performance on your language/script distribution
GutenOCR
You need reliable text-to-location grounding for downstream extraction
Grounded OCR is core design objective and first-class output
Weight license is CC-BY-NC; commercial use may be constrained
HunyuanOCR
You want one compact model for broad document tasks
Strong reported compact-model results across parsing-oriented tasks
Custom community license requires legal/compliance review
PaddleOCR-VL-1.5
Your inputs are messy scans/photos and you already run Paddle tooling
Near-frontier reported OmniDocBench score with robustness framing
Confirm accuracy on your distortion mix and template families
4.2 Adoption and maturity signals (Feb 13, 2026 snapshot)
These are not quality scores. They are practical signals for implementation confidence and ecosystem support.
Model
Maturity signal
What it means for rollout
GLM-OCR
Rapid early GitHub/HF uptake after launch
Fast-moving ecosystem, but still early for stability assumptions
dots.ocr-1.5
Fresh Feb 16, 2026 release with expanded task scope
High upside for multi-task use cases, but treat current results as early-cycle evidence
FireRed-OCR
March 2026 release with repo, model card, and paper all live at launch
Stronger evidence package than many brand-new challengers, but still early for stability assumptions
DeepSeek-OCR-2
Strong HF traction soon after release
Good community momentum for tooling and examples
HunyuanOCR
High visibility and broad activity across channels
More examples in the wild for compact deployment patterns
GutenOCR
Growing technical interest from doc-AI builders
Strong relevance for grounding-heavy extraction workflows
LightOnOCR-2-1B
Attention driven by 1B speed/quality profile
Good candidate for throughput-first deployments
PaddleOCR-VL-1.5
Benchmark-competitive and aligned with Paddle stack users
Lower integration risk if your team already uses Paddle
5 A practical evaluation protocol (6 core models + challengers)
If you want one rigorous, reproducible process, run one fixed 50-page bake-off across six core models:
GutenOCR
HunyuanOCR
LightOnOCR-2-1B
DeepSeek-OCR-2
GLM-OCR
PaddleOCR-VL-1.5
Then add challenger tracks for newly released models. For this cycle:
dots.ocr-1.5 (especially if you need OCR plus web/screen/scene/SVG parsing)
FireRed-OCR (especially if malformed Markdown, broken formulas, or table closure failures are expensive in your workflow)
5.1 Preflight gates (before benchmarking)
Filter models before inference:
License/commercial gate
Region/compliance gate
Serving/runtime gate
Output-format gate
Practical note:
GutenOCR weights are CC-BY-NC, which often disqualifies direct commercial deployment.
HunyuanOCR uses a custom community license with territory and usage constraints, so legal review should happen before production rollout.
5.2 50-page stratified set
Slice
Pages
Why this slice matters
Clean digital single-column PDFs
8
Baseline text fidelity
Multi-column + sidebars + footnotes
8
Reading-order stress
Table-heavy documents
8
Structure fidelity and cell ordering
Formula-heavy documents
6
Formula extraction and sequencing
Forms/invoices/receipts
6
Region association and key-value linking
Messy photos/scans
10
Skew, warping, glare, and capture artifacts
Multilingual mixed-script pages
4
Language/layout stability
Total: 50 pages.
5.3 Ground-truth package per page
Prepare three artifacts for each page:
gt_text.txt
gt_markdown.md
gt_blocks.json with block_id, text, bbox, reading_index, and type
Quality control:
Dual-annotate all messy-photo pages and all multi-column pages.
Resolve disagreements before scoring.
5.4 Inference protocol (same policy for all models)
Render all pages at one fixed resolution (for example, 200 DPI).
Use deterministic decoding (temperature=0, no retries in the primary run).
Freeze model versions/commits and prompts.
Run one no-heuristic primary pass; report heuristic retries separately if used.
Mode recommendations:
DeepSeek-OCR-2: run both Free OCR and markdown conversion mode.
GLM-OCR: run markdown plus JSON layout output.
PaddleOCR-VL-1.5: run full document parsing mode.
dots.ocr-1.5: run document parsing mode first; if relevant, add web parsing and scene spotting prompts, and evaluate SVG output in a separate track.
FireRed-OCR: run its standard structured Markdown mode and score syntax-validity failures separately from plain text errors.
GutenOCR: run grounded mode (bbox outputs) and plain text mode.
HunyuanOCR: run document parsing prompt and spotting-style prompt where applicable.
LightOnOCR-2-1B: run standard OCR parsing mode.
5.5 Metrics
If this is your first pass, read this section as a checklist for formal evaluation teams. The important idea is simple: score text accuracy, layout order, table structure, and operational reliability separately so one strong number cannot hide a weak production behavior.
Reading-order metrics:
RO-ED (normalized reading-order edit distance, lower is better)
Kendall tau on reading_index sequence (higher is better)
Missing-block rate
Duplicate-block rate
Content and structure metrics:
CER and WER
Table TEDS
Formula metric (CDM or token-level F1, fixed across all runs)
Benchmark overfitting risk: do not promote a model to primary production without document-type stratified tests.
Layout drift risk: table structure quality can degrade faster than plain text quality across new templates.
Grounding risk: extraction pipelines fail when text is correct but linked to the wrong box or wrong row.
License risk: confirm commercial terms for each model/repo combination, not just the model card headline.
Operations risk: define fallback modes (text-only, markdown, or dual-model checks) before first rollout.
7 Conclusion
By February 2026, the market is no longer about finding one giant model to do everything. It is about matching the model to the failure mode you can least afford.
A practical rollout is:
Start with the use-case matrix in Section 1.
Shortlist three models with different strengths.
Run the fixed 50-page protocol.
Promote one primary model and one fallback model, then keep one fast-moving release lane for models like dots.ocr-1.5 or FireRed-OCR.
That is usually safer than picking one leaderboard winner and hoping the same order will hold on your own document mix.