How We Benchmark OCR Models on Scan-Heavy PDFs

Download printable cheat-sheet (CC-BY 4.0)

12 Mar 2026, 00:00 Z

Start here

Update (Mar 2026):
This post still documents the original 31-PDF scan-heavy pilot and why the first practical routing rule emerged from that audit.
The public OCR benchmark story has moved on since then: the newer workflow-boundary comparison now includes Hunyuan and DeepSeek alongside GLM and FireRed.
At that newer workflow boundary, Hunyuan is the strongest grounded workflow, DeepSeek is the second grounded workflow and the strongest blank-page detector, FireRed remains the best balanced workflow, and GLM remains the fastest typical workflow.
Use this article for the method and the limits, then use the updated market-map and workflow-fit posts for the current shortlist and deployment answer: https://instavar.com/blog/ai-production-stack/OCR_SOTA_Feb_2026_Open_Document_AI_Leaderboard
https://instavar.com/blog/ai-production-stack/Which_OCR_Model_Fits_Which_Workflow_in_2026

Short version

this page explains how the 31-PDF, 1331-page OCR pilot was run and why the final result changed over time
models tested hands-on in the pilot: GLM-OCR, dots.ocr-1.5, MonkeyOCR, PaddleOCR PP-StructureV3, and FireRed-OCR
if you only need the workflow routing answer, use the workflow-fit guide: https://instavar.com/blog/ai-production-stack/Which_OCR_Model_Fits_Which_Workflow_in_2026

This page is for the method, the evidence, and the limits.

Trust basis: this was run self-hosted on a single RTX 3090 Ti 24 GB box, the raw outputs were kept, the harness was versioned as it changed, and every page in the corpus was reviewed at contact-sheet scale before the disputed pages were checked again at higher zoom.

The next section gives the gist in under a minute. The later sections move into the method, evidence, and appendices.

If you only have 1 minute

What we tested:

31 scanned chemistry PDFs
1331 pages
GLM-OCR, dots.ocr-1.5, MonkeyOCR, PaddleOCR PP-StructureV3, and FireRed-OCR
one self-hosted workflow on a single

Need	Best default from this pilot	Why
Clean Markdown from text-first scanned notes	`FireRed-OCR`	Lower cleanup burden once blank-page gating and page-image preservation were fixed
Diagram-linked question pages	`GLM-OCR`	Better inline region preservation for question-local visuals
OCR plus broader visual parsing	`dots.ocr-1.5`	Better fit when the requirement extends beyond classic document OCR
Mixed PDFs that alternate between notes and figure-heavy worksheet pages	`Hybrid routing`	Different page types clearly favored different models

Model	Best fit from this pilot	Main risk if overused
`FireRed-OCR`	Text-first notes, answer pages, formula-heavy revision pages	Can still lose question-local visuals if a page really depends on them
`GLM-OCR`	Diagram-dependent question pages	Materially noisier plain-text Markdown on many long note pages
`dots.ocr-1.5`	OCR plus web, screen, scene, or SVG-style parsing	Not the safest default for scan-heavy school PDFs
`PaddleOCR PP-StructureV3`	Modular parsing and competitive fallback on messy worksheets	Less clean overall in this Markdown-first comparison
`MonkeyOCR`	Isolated wins on some pages/documents	Too unstable overall to become the default

Model	Artifact score total	Notes
GLM-OCR	`2117`	Clear early default on this corpus
dots.ocr-1.5	`5750`	Hurt by hallucinations and watermark-like residue in that raw pass
MonkeyOCR	`4696`	Some strong page/document wins, but unstable overall

Model	Artifact score total	Notes
GLM-OCR	`4695`	Strongest overall baseline in that `4`-way pass
dots.ocr-1.5	`6098`	Still a weak default for this scan-heavy OCR slice
MonkeyOCR	`6755`	Too noisy overall despite some wins
PaddleOCR PP-StructureV3	`5827`	Much more competitive than a paper-only read would suggest

Model	Artifact score total	Notes
FireRed-OCR	`2215`	Lowest total in the patched run; `426` page-image refs preserved
GLM-OCR	`4655`	Strong on diagram-linked pages, but materially noisier on many text-first pages
PaddleOCR PP-StructureV3	`5787`	Competitive on some messy pages, but less clean overall in this run
dots.ocr-1.5	`6053`	Better read as a broader visual parser than a pure OCR default
MonkeyOCR	`6715`	Some document wins, but weakest total in the final run

Dimension	Why it matters	What to check in practice
Text fidelity	Raw text still has to be right before anything else matters	Missing lines, substitutions, duplicated spans, truncation
Structural integrity	Markdown, HTML, or JSON breakage directly affects cleanup cost	Broken lists, malformed tables, invalid nesting, broken formula blocks
Locality / grounding	Some pages depend on small local visuals, not just page-level text	Inline diagram preservation, region references, question-local figures
Output format fit	The right output depends on downstream use, not just OCR quality	Markdown for editing, HTML for layout retention, JSON for extraction
Table behavior	Tables often fail differently from prose	Row and column closure, merged cells, header retention, readable export
Formula behavior	Formula corruption can make science and technical documents unusable	LaTeX validity, symbol substitutions, inline versus block math stability
Blank-page behavior	Hallucinations on empty scans can distort benchmark results	Empty-page skip, near-blank gating, false positives
Runtime / serving path	A good model that is awkward to serve may still be the wrong choice	Single-GPU fit, batch throughput, wrapper complexity, memory use
Domain fit	Benchmarks underrepresent many real document types	Notes, worksheets, forms, receipts, bank statements, multilingual scans

Start here

If you only have 1 minute

Turn AI video into a repeatable engine

If you have 5 minutes

What we learned

Why the benchmark result changed over time

Full decision matrix

Where each model fit

What to do with that conclusion

Full methodology

1 What this benchmark was trying to answer

2 Corpus design

2.1 A benchmark corpus is not enough without page archetypes

2.2 Execution environment

2.3 Benchmark contract: what has to stay fixed

2.4 Benchmark changelog and run versioning

3 Model set and run sequence

3.1 Public benchmark families we track

3.2 Run history

Raw 3-way comparison

Raw 4-way comparison

Patched 5-way comparison

4 Comparison dimensions we score before choosing a model

5 Scoring and visual audit

5.1 Artifact score

5.2 Manual visual audit layer

6 Failure modes that actually changed the routing

6.1 Blank-page hallucination

6.2 Inline diagram loss

6.3 Text-cleanliness and cleanup cost

6.4 Broader parser vs better OCR default

6.5 The early harness and the final harness were not identical

7 Concrete documents that changed the routing policy

7.1 GLM-only case

7.2 Hybrid cases

7.3 FireRed-heavy wins

8 The routing rule that survived the audit

What this benchmark does not prove

Appendix A: Concrete page evidence from the pilot

Example 1: FireRed on a text-first notes page

Example 2: GLM on an inline reaction-scheme question

Example 3: Why mixed documents need hybrid routing, part 1

Example 4: Why mixed documents need hybrid routing, part 2

Example 5: Why benchmark versioning matters

Appendix B: Full workflow to reuse the method

Appendix C: Related OCR reading

Related Posts

YouTube Shorts for AI-Generated Content - Rules, Monetization, and What Gets Flagged

Voice Cloning on a 24GB GPU - What Actually Works in 2026

How to Run an AI Video Model Bakeoff Without Turning It Into Vibes