We use essential cookies to run Instavar and optional analytics cookies to understand how the site is used. Reliability monitoring remains active to keep the service secure and available. Cookie Policy
Manage Cookie Preferences
Service reliability telemetry, including Sentry error monitoring and Vercel Speed Insights, stays enabled so we can secure the product and diagnose failures.
This guide answers a practical build question: if your PDFs are scanned images, which OCR model should handle each kind of page.
The short version
No single model won all document types in the 50-page benchmark.
Qianfan had the lowest aggregate character error rate, or CER, at 12.8%.
The more useful result is by page type: GLM dominated diagram pages at 6.1% CER, Hunyuan was strong on low-contrast scans at 6.6% CER, and Qianfan swept text, tables, formulas, and worksheets.
The right production answer is a routing rule, not one default model for every scanned PDF.
The one-minute decision path
Scanned PDFs fail differently from born-digital PDFs because the text is inside page images. A clean notes page, a diagram question, a faint scan, and a blank separator page need different safeguards.
Read this page in three passes:
use the quick routing table below for the first implementation choice
check the page-type results before trusting the aggregate score
use the failure notes to decide where to add fallback models or human review
If the scanned page is...
Start with...
Why
text-first notes, tables, formulas, or worksheets
Qianfan
lowest measured error across those page types
diagram-heavy or figure-linked
GLM
strongest measured diagram-page result
low-contrast or faint
Qianfan, with Hunyuan as fallback
AI video production
Turn AI video into a repeatable engine
Build an AI-assisted video pipeline with hook-first scripts, brand-safe edits, and multi-platform delivery.
strongest measured low-contrast result, with a grounded fallback
blank or near-blank
Hunyuan, DeepSeek, or Qianfan
avoids blank-page hallucination seen in weaker lanes
mixed document
page-type routing
one aggregate winner hides page-specific failures
Where this fits
For founders: if you are building an OCR pipeline, do not pick one model - route by document type. This page gives you the routing table with the data behind it, so you can skip the bakeoff and ship.
For engineers: the tables below show which model fails least on each page type. Use them to set routing thresholds, pick fallback models, and avoid models that hallucinate on your document mix.
Quick definitions:
CER means character error rate. Lower is better.
WER means word error rate. Lower is better.
Archetype means page type: notes, diagrams, tables, formulas, worksheets, blank pages, or faint scans.
Routing means sending each page type to the model that handled it best in testing.
Why scanned PDFs are hard
Born-digital PDFs have extractable text layers. You can copy-paste from them, search them, and feed them straight into downstream pipelines.
Scanned PDFs are images. Every page is a raster - OCR must reconstruct text from pixels. That reconstruction fails in predictable ways:
Degraded scans - faded ink, uneven lighting, skewed pages, coffee stains. The model sees noise where you see text.
Complex layouts - multi-column pages, nested tables, sidebars, footnotes. The reading-order problem is as hard as the character-recognition problem.
Formulas - mathematical notation requires spatial reasoning that most OCR models were not trained for. A subscript in the wrong place changes the meaning entirely.
Diagrams with embedded text - flowcharts, circuit diagrams, annotated figures. The model must separate diagram elements from readable text.
Handwritten content - annotations, margin notes, filled-in worksheets. Most models trained on printed text struggle here.
The failure mode is not "OCR returns nothing." The failure mode is OCR returns plausible-looking text that is wrong - and you do not notice until a downstream consumer breaks.
Test methodology
We used the lightonocr-slice-v1 corpus: 50 pages drawn from real scanned PDFs, classified into 7 page types by visual structure.
Corpus breakdown
Archetype
Pages
What it tests
text_first_notes
10
Clean printed text, minimal layout complexity
diagram_question
10
Inline diagrams with embedded text labels
table_heavy
8
Multi-row, multi-column tabular data
formula_heavy
8
Mathematical notation (LaTeX-level complexity)
worksheet_options
8
Multiple-choice layouts, numbered items
blank_or_near_blank
3
Pages with little or no content (false positive test)
low_contrast_or_faint_scan
3
Degraded, faded, or low-contrast scans
Models tested
Five open OCR models, each run on every page:
Qianfan (Baidu)
GLM (Zhipu AI)
Hunyuan (Tencent)
FireRed (FireRed AI)
DeepSeek (DeepSeek)
Evaluation method
We computed CER (Character Error Rate) and WER (Word Error Rate) using cross-model consensus as the reference. Where no human ground truth exists, the highest-consensus model output serves as the reference string. This is not a perfect proxy - but it is a practical one that scales to hundreds of pages without manual transcription.
For interactive side-by-side comparison and per-page voting, see the internal /ocr-review tool.
How to read the numbers: a low aggregate score is useful, but it is not enough. A model can look good overall and still fail badly on diagrams or formulas. That is why the document-type breakdown matters more than the headline table.
Results: aggregate comparison
Model
CER (%)
WER (%)
Qianfan
12.8
13.18
GLM
33.84
27.59
Hunyuan
35.5
29.3
FireRed
39.01
23.88
DeepSeek
39.34
33.39
Qianfan leads by a wide margin on aggregate CER. But aggregate scores hide page-specific performance. GLM, for example, is 3.6x better than Qianfan on diagram pages - a fact invisible in the aggregate table.
The page-type breakdown below is where the routing decisions come from.
Results by document type
CER (%) by page type
Archetype
Pages
FireRed
GLM
Hunyuan
DeepSeek
Qianfan
text_first_notes
10
10.0
20.7
8.2
8.3
5.9
diagram_question
10
39.9
6.1
65.9
30.2
22.0
formula_heavy
8
78.7
108.6
42.5
76.6
20.7
table_heavy
8
39.7
35.6
63.6
43.8
15.7
worksheet_options
8
12.2
15.7
16.2
46.5
7.1
low_contrast_or_faint_scan
3
16.3
14.2
6.6
69.1
0.0
blank_or_near_blank
2
158.8
N/A
0.0
0.0
0.0
Bold marks the best model per page type.
Text-first notes
All models perform reasonably on clean printed text. Qianfan is best at 5.9% CER. Hunyuan (8.2%) and DeepSeek (8.3%) are close behind. GLM lags at 20.7% - acceptable for many use cases, but not best-in-class for this page type.
Takeaway: for straightforward text pages, any model works. Qianfan and Hunyuan are the safest picks.
Diagram questions
GLM dominates at 6.1% CER - nearly 4x better than Qianfan (22.0%) and over 10x better than Hunyuan (65.9%). Hunyuan struggles badly with inline diagrams, likely confusing diagram elements with text.
Takeaway: use GLM for any page with inline diagrams, flowcharts, or annotated figures.
Formula-heavy
Qianfan wins at 20.7% CER. Hunyuan is a distant second at 42.5%. GLM is the worst at 108.6% - a CER above 100% means the model hallucinated more characters than exist in the reference. GLM actively fabricates content when it encounters formulas.
Takeaway: use Qianfan for mathematical content. Avoid GLM entirely on formula pages.
Table-heavy
Qianfan again leads at 15.7% CER. GLM (35.6%) and FireRed (39.7%) are in the middle. Hunyuan is worst at 63.6% - it struggles with multi-column alignment and cell boundaries.
Takeaway: Qianfan for tables. GLM is an acceptable runner-up.
Worksheet/options
Qianfan best at 7.1%. FireRed (12.2%) and GLM (15.7%) are reasonable. DeepSeek is worst at 46.5% - it misreads option labels and numbering.
Takeaway: Qianfan or FireRed for multiple-choice and worksheet layouts.
Low-contrast / faint scans
Qianfan achieves 0.0% CER on the low-contrast pages in this corpus. Hunyuan is good at 6.6%. DeepSeek is terrible at 69.1% - it fails to extract legible text from degraded scans.
Takeaway: Qianfan handles degraded scans best. Hunyuan is the fallback.
Blank / near-blank pages
Hunyuan, DeepSeek, and Qianfan all correctly return empty or near-empty output (0.0% CER). FireRed hallucinates text on blank pages - 158.8% CER means it generated far more text than exists on the page.
Takeaway: if your pipeline processes blank pages (common in batch-scanned documents), avoid FireRed. Use Hunyuan or DeepSeek as blank-page detectors.
The routing decision tree
Based on the page-type data above, here is the routing table we use:
Document type
Best model
Runner-up
Avoid
Text-first notes
Qianfan
Hunyuan
-
Diagram questions
GLM
Qianfan
Hunyuan
Formula-heavy
Qianfan
Hunyuan
GLM
Table-heavy
Qianfan
GLM
Hunyuan
Worksheets
Qianfan
FireRed
DeepSeek
Low-contrast scans
Qianfan
Hunyuan
DeepSeek
Blank pages
Hunyuan or DeepSeek
Qianfan
FireRed
Mixed (unknown type)
Route by page type
Qianfan as fallback
-
The "Avoid" column is not theoretical. GLM on formulas hallucinates. FireRed on blank pages hallucinates. DeepSeek on degraded scans returns garbage. These are not edge cases - they are systematic failures that a routing rule prevents.
Processing speed comparison
Model
Latency (s/page)
Relative speed
GLM
0.9
1x (baseline)
FireRed
3.4
3.8x slower
Hunyuan
6.6
7.3x slower
DeepSeek
14.8
16.4x slower
GLM at 0.9 seconds per page is 16x faster than DeepSeek at 14.8 seconds. Qianfan latency data was not available for this benchmark run.
The speed/accuracy tradeoff varies by page type. GLM is fast and best on diagrams - but worst on formulas. If your document mix is diagram-heavy, GLM gives you both speed and accuracy. If your mix is formula-heavy, the fastest accurate option is Qianfan.
For batch processing pipelines where latency matters less than accuracy, optimise for CER. For real-time or interactive use cases, GLM's speed advantage is significant.
How to build a routing pipeline
The routing table above assumes you know the document type before calling OCR. In practice, you need a classifier upstream.
Option 1: histogram-based classifier. Compute image-level features - text density, line spacing, presence of large non-text regions - and classify pages with simple heuristics. Fast, no GPU required, works for coarse routing.
Option 2: lightweight vision model. Run a small vision model (or the first few layers of a larger one) to classify the page type. More accurate than histograms, but adds latency and cost.
Option 3: two-pass OCR. Run a fast model (GLM at 0.9s/page) first, then decide based on the output whether to re-run with a more accurate model. For example: if GLM output contains LaTeX-like sequences, re-run with Qianfan.
The instavar.com OCR router uses a variant of option 1 combined with page-type-specific confidence thresholds. For the implementation details and how this connects to the broader pipeline, see the hub page.
FAQ
Which single model should I use if I can only run one?
Qianfan. It has the lowest aggregate CER (12.8%) and wins 5 out of 7 page types. Its only weakness is diagrams, where GLM is 3.6x better. If your document mix includes few diagrams, Qianfan is the safe default.
Is Tesseract still relevant?
For clean printed text with simple layouts, Tesseract is still functional and free. For scanned documents with complex layouts, degraded quality, formulas, or tables, Tesseract falls behind the models tested here by a wide margin. If you are building a new pipeline in 2026, start with one of the models above.
What about commercial OCR APIs like Mistral OCR 3 or Reducto?
They are viable alternatives if you do not want to self-host. We did not include them in this benchmark because the focus was on open models you can run on your own infrastructure. A commercial API comparison is a separate evaluation with different constraints (cost per page, data residency, rate limits).
How do I evaluate OCR quality on my own documents?
Compute CER and WER against a reference. If you have human-transcribed ground truth, use that. If you do not, use cross-model consensus - run multiple models on the same page and use the highest-agreement output as the reference. Supplement with qualitative spot-checking on edge cases (formulas, tables, degraded scans). The /ocr-review tool we built does exactly this.
Can I combine multiple OCR models?
Yes - that is the entire point of this article. Route by page type to the model that performs best on that type. The routing table above is the decision input. The engineering cost is a page classifier plus model dispatch logic; the accuracy gain is substantial.
Why is CER above 100% in some cells?
CER measures the edit distance between the OCR output and the reference, normalised by the reference length. A CER above 100% means the model produced more erroneous characters than the reference contains - typically because it hallucinated text that does not exist on the page. GLM at 108.6% on formulas and FireRed at 158.8% on blank pages are both hallucination failures.