We use essential cookies to run Instavar and optional analytics cookies to understand how the site is used. Reliability monitoring remains active to keep the service secure and available. Cookie Policy
Manage Cookie Preferences
Service reliability telemetry, including Sentry error monitoring and Vercel Speed Insights, stays enabled so we can secure the product and diagnose failures.
Most OCR comparisons still start with the benchmark table.
The harder production question is simpler: which model breaks least often on the pages you actually have.
This guide is organised around that question. On our scan-heavy OCR pilot, the useful conclusion was not one universal winner. It was a routing rule:
FireRed-OCR became the best default for text-first pages once its wrapper handled blank pages and preserved page images
GLM-OCR stayed safer when the question depends on a small inline graph, apparatus, particle diagram, or reaction scheme
dots.ocr-1.5 was more compelling when OCR was only one part of a broader visual parsing workflow
PaddleOCR-VL-1.5 stayed relevant when a team wanted a mature OCR baseline tied to a broader parsing ecosystem
Update (Mar 2026): The newer full-50 workflow benchmark widened the practical ranking beyond the original FireRed versus GLM routing story. Hunyuan is now the strongest grounded workflow, DeepSeek is the second grounded workflow and the only one to detect all 3/3 blank pages in the current full-50 run, FireRed remains the best balanced workflow, and GLM remains the fastest normal-case workflow. Qianfan is now a promoted workflow and belongs in the routing map as the markdown-oriented fallback lane. A page-level router across all five promoted workflows (FireRed, GLM, Hunyuan, DeepSeek, Qianfan) is operational and under active iteration, but not yet promoted as a default - see Section 10 for the early benchmark results. That means the deployment answer is now a five-lane map, not just a single FireRed/GLM split.
choose FireRed-OCR when the page is mostly notes, bullets, tables, worked answers, or formulas
choose GLM-OCR when the page is really asking the reader to interpret a small local visual
choose HunyuanOCR when grounded coordinate-rich output matters more than latency
choose DeepSeek-OCR-2 when you want stronger grounding than GLM or FireRed and need better blank-page handling
choose Qianfan when you want a clean markdown-first OCR path without a grounding requirement
choose dots.ocr-1.5 when OCR is only one piece of a larger visual-language parsing workflow
keep PaddleOCR-VL-1.5 in the shortlist if you want a mature baseline with a broader surrounding document stack
choose Mistral OCR 3 or Reducto when you want a managed API and don't want to run inference infrastructure (see Section 9.5)
1 The shortest workflow-fit answer
Your workflow bottleneck
Best first model to test
Why
Highest grounded workflow output
HunyuanOCR
It is now the strongest grounded workflow in the current full-50 hands-on benchmark
Strong grounding plus strict blank-page handling
DeepSeek-OCR-2
It is the only workflow in the current full-50 run to detect 3/3 blank pages
Cleanup cost on text-heavy scans
FireRed-OCR
It became the cleanest Markdown-first default on text-first pages in the patched pilot
Diagram-linked question pages
GLM-OCR
It preserved question-local visuals more safely than the other stacks
Clean markdown without a grounding requirement
Qianfan
Promoted markdown-oriented workflow, validated at page and document level
OCR plus web, screen, scene, or SVG-style parsing
dots.ocr-1.5
It is the clearest broader parser in this group, not only a document OCR model
Mature baseline plus ecosystem depth
PaddleOCR-VL-1.5
It remains a strong OCR baseline with a wider surrounding parsing ecosystem
Mixed PDFs that alternate between notes and figure-heavy worksheet pages
Page-level routing
Different page types favour different models; see Section 10 for the router benchmark
Managed API, no GPU infrastructure, fast integration
Mistral OCR 3
$1–2/1K pages, no self-hosting needed, no regional restrictions
Enterprise extraction with vendor SLA and agentic review
Reducto
a16z-backed, agentic OCR, from $0.015/page
2 What the pilot changed
The scan-heavy pilot did not support a simple “best OCR model in 2026” claim.
What changed:
early raw comparisons made GLM-OCR look like the safest overall default
once the FireRed-OCR wrapper stopped hallucinating on near-blank pages and preserved page images, the result shifted materially
after that patch, FireRed-OCR led 24/31 documents in the final 5-way run, but the remaining GLM-OCR wins were still meaningful because they were diagram-question-heavy
The practical result was that model choice changed with page type.
The newer workflow benchmark also sharpened the ranking at the workflow boundary:
Hunyuan is now the strongest grounded workflow
DeepSeek is the second grounded workflow and the strongest blank-page detector
FireRed remains the best balanced operational choice
GLM remains the fastest typical workflow
Qianfan is now a promoted workflow, adding a markdown-oriented lane to the five-workflow map
That does not replace the original routing rule. It makes the routing rule more complete.
3 When to use HunyuanOCR
Use HunyuanOCR first when grounded structured output matters more than speed.
Good fit:
extraction-heavy workflows
audit-heavy OCR review
pipelines that need dense page-coordinate grounding
cases where a human or downstream system needs to trace text back to page regions reliably
Why it works:
it is now the strongest grounded workflow in the current full-50 hands-on benchmark
it preserved far more structured anchors than GLM or FireRed
it is the clearest answer when the workflow needs grounded OCR rather than just readable markdown
Watch-outs:
it is slower than GLM and FireRed
the raw output usually needs more normalization before it becomes pleasant to read
4 When to use DeepSeek-OCR-2
Use DeepSeek-OCR-2 first when you want stronger grounding than GLM or FireRed, but do not need to beat Hunyuan.
Good fit:
OCR lanes where blank-page detection has to be strict
workflows that benefit from grounded output but do not need the densest possible coordinate stream
teams that want a second grounded option instead of relying on one model family only
Why it works:
it is now the second grounded workflow in the current full-50 hands-on benchmark
it was the only workflow in that run to detect all 3/3 blank pages
it preserved materially more grounding than GLM or FireRed
Watch-outs:
it is the slowest current workflow in the measured set
the helper-driven workflow path adds startup overhead and exaggerates latency versus the service-backed lanes
5 When to use FireRed-OCR
Use FireRed-OCR first when the expensive failure mode is structural cleanup on text-first pages.
Good fit:
revision notes
bullet-heavy teaching pages
answer keys
formula-heavy explanations
tables that still need to read linearly in Markdown
Why it works:
the structural OCR focus is real, not just branding
once blank-page handling and page-image preservation were fixed, its text-first output was often cleaner than the compact OCR baselines
it is the better default when the page is mostly meant to be read top-to-bottom
Watch-outs:
pages where the answer depends on a small local visual instead of the surrounding prose
worksheet pages with many inline answer-option figures
pages where a reaction scheme or apparatus needs to stay tied to a specific nearby sentence
FireRed-OCR is the best first test when the main cost is messy Markdown, not missing diagrams.
6 When to use GLM-OCR
Use GLM-OCR first when the page is visually local and the text depends on that locality.
Good fit:
question pages that say “the diagram below”
apparatus-linked practical questions
particle-box choices
reaction-network questions
small graphs embedded inside a worksheet question
Why it works:
it preserved inline regions more safely than the other models in the pilot
it remained the safer choice on diagram-question-heavy documents even after the FireRed-OCR pipeline improved
the production issue here is not only recognition accuracy but keeping the right visual tied to the right question
Watch-outs:
long text-heavy notes where its raw Markdown tends to be noisier than the best FireRed-OCR output
pages where the main job is reading prose, tables, or answers, not preserving local diagrams
If the question breaks once the inline figure disappears, GLM-OCR should stay in the routing path.
7 When to use dots.ocr-1.5
Use dots.ocr-1.5 when your real requirement is broader than document OCR.
Good fit:
OCR plus web page parsing
OCR plus screen parsing
OCR plus scene text
workflows where SVG-like structure or non-document visual parsing also matters
Why it works:
it is positioned more clearly as a broader visual parser than as a narrow document OCR specialist
it deserves a slot when one stack may need to cover several parsing modes, not just scanned PDFs
Watch-outs:
as the default OCR engine for scan-heavy school notes or worksheets
when your hardest pages are text-heavy PDFs and your main cost is Markdown cleanup
In the internal pilot, dots.ocr-1.5 won only 2/31 documents in both the raw 3-way and patched 5-way comparisons. That does not make it unimportant. It means its main value is different from the main value of GLM-OCR or FireRed-OCR.
8 When to keep PaddleOCR-VL-1.5 in the shortlist
Keep PaddleOCR-VL-1.5 in the shortlist when you want a strong OCR baseline with a broader ecosystem around it.
Good fit:
teams that care about ecosystem maturity as much as one model score
document pipelines that may expand into broader modular parsing workflows
OCR teams that want a serious baseline even if it is not the final default for every page type
Why it still belongs:
it is still one of the strongest public OCR baselines in this generation
it has a fuller surrounding ecosystem than the newer challengers
it is useful when the model decision is really a stack decision
Watch-outs:
in this specific Markdown-first, scan-heavy comparison, it was not the main story once FireRed-OCR was patched and GLM-OCR was already present as the diagram-safe baseline
9 When to use Qianfan
Use Qianfan when you want a workflow-validated markdown-first OCR path without a grounding requirement.
Good fit:
text-first pages where clean readable markdown matters more than coordinate-rich output
pipelines where the grounding-heavy lane is already covered by Hunyuan or DeepSeek and you need a lighter fallback
teams building a multi-lane router that needs a validated markdown-oriented option alongside the service-backed workflows
cases where FireRed is occupied or over-committed and a second text-first lane adds throughput headroom
Why it belongs:
it is now a promoted workflow in the validated set alongside FireRed, GLM, Hunyuan, and DeepSeek
its workflow path is tested at both page and document level
its markdown output is clean and relatively consistent on text-first pages
Watch-outs:
it does not beat FireRed on balanced text-first pages, so do not prefer it when FireRed is available and the page is clearly text-first
grounding is sparse - it is not an alternative when coordinate-rich anchors matter
it is slower than GLM and FireRed in median operation
Qianfan earns its place as the markdown-oriented fallback lane in a multi-workflow router, not as the primary choice for any single page type.
9.5 When to use a commercial OCR API instead
The models above are all open-source or open-weight. If you want a managed API and can accept the pricing, two commercial options are worth evaluating alongside your open-source shortlist.
Mistral OCR 3
Product: Managed API (mistral-ocr-2512). No model weights available.
Pricing:2per1,000pages(1 in batch mode).
License: Apache 2.0 for the API usage terms. No regional restrictions (unlike some Chinese-origin models).
Good fit:
teams that want an API call and don't want to run inference infrastructure
workflows where $1–2/1,000 pages is acceptable and you need fast integration
cases where regional compliance matters (Mistral has no PRC-origin licensing concerns)
Watch-outs:
no model weights, no self-hosting option - you are locked to the API
no public evaluation methodology or benchmark dataset from Mistral
no open-source ecosystem - all coverage is third-party (VentureBeat, InfoQ, PyImageSearch)
if you want to evaluate Mistral OCR 3 against your own corpus, you pay per page during evaluation
Reducto
Product: Parse API + Extract + Split + Edit endpoints. Agentic OCR that reviews and corrects outputs in real-time.
Pricing: From $0.015/page with volume discounts.
Funding:108Mtotal(75M Series B from a16z, Feb 2026).
Good fit:
enterprise document intelligence pipelines with structured extraction requirements
teams that value vendor stability (well-funded, a16z-backed)
workflows where the API needs to do more than OCR - extraction, splitting, editing in one call
Watch-outs:
no open-source core product - you are locked to the API
the blog is case-study oriented with no technical architecture content
no community engagement (no Discord, no GitHub for core product)
they released RolmOCR (fine-tuned olmOCR variant) as an open-source contribution, which shows ecosystem engagement but does not make the core product open
How to position commercial APIs in a routing decision
Decision factor
Favour open-source
Favour commercial API
Cost sensitivity
✅ Free inference once deployed
❌ Per-page pricing adds up at scale
Infrastructure team available
✅ Self-host on your own GPUs
❌ API eliminates infra work
Corpus-specific tuning needed
✅ Can fine-tune or patch wrappers
❌ No fine-tuning on commercial APIs
Compliance / data residency
✅ Data never leaves your infrastructure
❌ Data sent to third-party servers
Speed to first result
❌ Setup, model download, GPU provisioning
✅ API call in minutes
Production SLA required
❌ Self-managed uptime
✅ Vendor SLA (check terms)
For most teams doing the kind of scan-heavy OCR we benchmark here, the open-source routing approach wins because the per-page cost advantage compounds at volume and the ability to patch wrappers (as we did with FireRed blank-page handling) is critical for production quality. Commercial APIs are worth evaluating when you need speed to first result and don't have GPU infrastructure.
10 Router v2 - page-level routing across the promoted set
The manual lane map above works when you know the page type in advance. For mixed corpora where page types vary unpredictably, a page-level router that selects the backend automatically is the practical answer.
We are actively iterating on a page-level router across FireRed, GLM, Hunyuan, DeepSeek, and Qianfan. Current route policy:
clean text-first pages → FireRed
diagram-dependent or bbox-sensitive pages → GLM
structured dense tables or formula-sparse layout pages → Hunyuan
low-contrast or blank-suspect pages → DeepSeek
markdown-oriented text-first fallback → Qianfan
near-blank pages → blank gate without model invocation
Early benchmark on a 10-page chemistry mixed slice:
Workflow
Text artifact score
Visual anchors
Expected anchor matches
Blank passes
FireRed
51
0
0/13
0/1
GLM
4
12
0/13
1/1
Hunyuan
0
303
13/13
1/1
Qianfan
5
14
0/13
1/1
Router v2
28
179
6/13
1/1
Mean latency on that slice: 8.77 s/page (backend mix: FireRed 3, GLM 2, Hunyuan 3, DeepSeek 1, blank gate 1).
What that shows: the routed slice beats FireRed, GLM, and Qianfan on grounded coverage for this chemistry page mix. It still trails Hunyuan on pure grounded accuracy (6/13 vs 13/13 expected anchor matches). On a larger 50-page slice, the router over-selected the lighter service-backed lanes (FireRed + GLM on 40 of 50 pages), which capped the quality ceiling and collapsed anchor matches.
The router is operational infrastructure and is actively improving. It is not a promoted default. Use the manual routing rule in Section 11 for production decisions until the routing policy closes more of the quality gap.
11 The practical routing rule
If you only want one routing policy from this article, use this:
If the workflow needs dense grounded structure for extraction or audit, start with HunyuanOCR.
If you need strong grounding and the cleanest blank-page handling, but can tolerate the slowest workflow in the set, evaluate DeepSeek-OCR-2.
If the page is mostly notes, bullets, tables, answers, or formulas, start with FireRed-OCR.
If the page depends on a small inline graph, apparatus, particle diagram, or reaction scheme, route it to GLM-OCR.
If you want a markdown-oriented fallback lane without a grounding requirement, use Qianfan.
If the workflow needs OCR plus broader web, screen, scene, or SVG-style parsing, evaluate dots.ocr-1.5 as a separate lane.
Keep PaddleOCR-VL-1.5 as a mature baseline when the surrounding parsing ecosystem matters, not just one page-level outcome.
That routing rule is more useful in production than arguing about a single universal winner.
12 What this means for mixed documents
Some PDFs should not be assigned to one model end to end.
That was the clearest lesson from the mixed chemistry worksheet packs in the pilot:
text-heavy note pages often favored FireRed-OCR
diagram-question pages often favored GLM-OCR
grounded-output-heavy workflows now point more clearly to HunyuanOCR
DeepSeek-OCR-2 now deserves its own grounded lane when blank-page handling matters
Qianfan adds a validated markdown-oriented lane when FireRed is already committed elsewhere
broader visual parsing questions still justified a separate dots.ocr-1.5 lane
If your corpus mixes several of these, treat routing as part of the product. The early router benchmark in Section 10 shows where automated page-level routing already helps and where it still needs work.
Do not wait to invent routing after rollout.
13 Bottom line
The right OCR choice in 2026 depends less on one benchmark headline than on the kind of page that hurts you when it breaks.
choose HunyuanOCR when grounded structured output matters more than speed
choose DeepSeek-OCR-2 when you want the second grounded workflow and the strongest blank-page handling
choose FireRed-OCR when cleanup cost on text-first pages is the bottleneck
choose GLM-OCR when question-local visuals have to stay attached to the right text
choose Qianfan when you want a validated markdown-oriented lane without a grounding requirement
choose dots.ocr-1.5 when OCR is only one part of a broader parser
keep PaddleOCR-VL-1.5 in the shortlist when ecosystem depth matters
That is the cleaner way to choose a stack than publishing another “best OCR model” table without page semantics.