Best OCR for Scanned PDFs - 5 Models Tested on 50 Real Documents

Download printable cheat-sheet (CC-BY 4.0)

28 Mar 2026, 00:00 Z

This guide answers a practical build question: if your PDFs are scanned images, which OCR model should handle each kind of page.

The short version

No single model won all document types in the 50-page benchmark.
Qianfan had the lowest aggregate character error rate, or CER, at 12.8%.
The more useful result is by page type: GLM dominated diagram pages at 6.1% CER, Hunyuan was strong on low-contrast scans at 6.6% CER, and Qianfan swept text, tables, formulas, and worksheets.
The right production answer is a routing rule, not one default model for every scanned PDF.

The one-minute decision path

Scanned PDFs fail differently from born-digital PDFs because the text is inside page images. A clean notes page, a diagram question, a faint scan, and a blank separator page need different safeguards.

Read this page in three passes:

use the quick routing table below for the first implementation choice
check the page-type results before trusting the aggregate score
use the failure notes to decide where to add fallback models or human review

If the scanned page is...	Start with...	Why
text-first notes, tables, formulas, or worksheets	`Qianfan`	lowest measured error across those page types
diagram-heavy or figure-linked	`GLM`	strongest measured diagram-page result
low-contrast or faint	`Qianfan`, with `Hunyuan` as fallback

Pain point	What to check	Practical answer
Scanned PDF vs normal PDF	Can you select text in the source file before OCR?	If yes, preserve the existing text layer. If no, treat each page as an image and run OCR.
Searchable text layer	Does the output need to be searchable in PDF readers or only useful as markdown/text?	Use OCRmyPDF or an archive workflow when the PDF itself needs a text layer. Use model markdown when downstream LLM processing matters more.
PDF/A and archive copies	Does the workflow need long-term archival output?	Keep a PDF/A or archive-copy lane separate from the model output lane. Do not overwrite originals until review is complete.
Paperless `skip`, `redo`, and `force`	Are you dealing with born-digital PDFs, bad scanner OCR, or image-only PDFs?	`skip` avoids redundant OCR, `redo` can repair weak text layers, and `force` is only safe when you accept replacing existing OCR.
Re-OCRing bad scanner OCR	Is the existing text layer present but wrong?	Test a redo lane on a sample set. A bad text layer can be worse than no text layer because downstream search trusts it.
Tables and numbers	Can a reviewer trace each extracted field to the source page?	Require source highlighting, field-level confidence, or human review for tables and financial data.
Messy scans	Are pages skewed, photographed, stamped, signed, or low contrast?	Add preprocessing: 300 DPI where possible, crop, deskew, contrast normalization, blank-page detection, and a fallback model for faint pages.
Handwriting and checkboxes	Is the content handwritten, checkbox-heavy, or form-like?	Treat it as a separate lane. Do not assume the same model that wins printed text will handle handwritten notes or checkboxes.

Archetype	Pages	What it tests
text_first_notes	10	Clean printed text, minimal layout complexity
diagram_question	10	Inline diagrams with embedded text labels
table_heavy	8	Multi-row, multi-column tabular data
formula_heavy	8	Mathematical notation (LaTeX-level complexity)
worksheet_options	8	Multiple-choice layouts, numbered items
blank_or_near_blank	3	Pages with little or no content (false positive test)
low_contrast_or_faint_scan	3	Degraded, faded, or low-contrast scans

Model	CER (%)	WER (%)
Qianfan	12.8	13.18
GLM	33.84	27.59
Hunyuan	35.5	29.3
FireRed	39.01	23.88
DeepSeek	39.34	33.39

Archetype	Pages	FireRed	GLM	Hunyuan	DeepSeek	Qianfan
text_first_notes	10	10.0	20.7	8.2	8.3	5.9
diagram_question	10	39.9	6.1	65.9	30.2	22.0
formula_heavy	8	78.7	108.6	42.5	76.6	20.7
table_heavy	8	39.7	35.6	63.6	43.8	15.7
worksheet_options	8	12.2	15.7	16.2	46.5	7.1
low_contrast_or_faint_scan	3	16.3	14.2	6.6	69.1	0.0
blank_or_near_blank	2	158.8	N/A	0.0	0.0	0.0

Document type	Best model	Runner-up	Avoid
Text-first notes	Qianfan	Hunyuan	-
Diagram questions	GLM	Qianfan	Hunyuan
Formula-heavy	Qianfan	Hunyuan	GLM
Table-heavy	Qianfan	GLM	Hunyuan
Worksheets	Qianfan	FireRed	DeepSeek
Low-contrast scans	Qianfan	Hunyuan	DeepSeek
Blank pages	Hunyuan or DeepSeek	Qianfan	FireRed
Mixed (unknown type)	Route by page type	Qianfan as fallback	-

Best OCR for Scanned PDFs - 5 Models Tested on 50 Real Documents

The short version

The one-minute decision path

Turn AI video into a repeatable engine

Where this fits

Why scanned PDFs are hard

Workflow pain: text layers, re-OCR, and trust

Test methodology

Corpus breakdown

Models tested

Evaluation method

Results: aggregate comparison

Results by document type

CER (%) by page type

Text-first notes

Diagram questions

Formula-heavy

Table-heavy

Worksheet/options

Low-contrast / faint scans

Blank / near-blank pages

The routing decision tree

Processing speed comparison

How to build a routing pipeline

FAQ

Which single model should I use if I can only run one?

Is Tesseract still relevant?

What about commercial OCR APIs like Mistral OCR 3 or Reducto?

How do I evaluate OCR quality on my own documents?

Can I combine multiple OCR models?

Why is CER above 100% in some cells?

Sources

Related Posts

Model	Latency (s/page)	Relative speed
GLM	0.9	1x (baseline)
FireRed	3.4	3.8x slower
Hunyuan	6.6	7.3x slower
DeepSeek	14.8	16.4x slower

The short version

The one-minute decision path

Turn AI video into a repeatable engine

Where this fits

Why scanned PDFs are hard

Workflow pain: text layers, re-OCR, and trust

Test methodology

Corpus breakdown

Models tested

Evaluation method

Results: aggregate comparison

Results by document type

CER (%) by page type

Text-first notes

Diagram questions

Formula-heavy

Table-heavy

Worksheet/options

Low-contrast / faint scans

Blank / near-blank pages

The routing decision tree

Processing speed comparison

How to build a routing pipeline

FAQ

Which single model should I use if I can only run one?

Is Tesseract still relevant?

What about commercial OCR APIs like Mistral OCR 3 or Reducto?

How do I evaluate OCR quality on my own documents?

Can I combine multiple OCR models?

Why is CER above 100% in some cells?

Sources

Related Posts

Open-Source Lip Sync Models Compared in 2026

Supertonic 3 On-Device TTS Reality Check on macOS

Function Calling and MCP First Principles