How We Benchmark OCR Models on Scan-Heavy PDFs

Download printable cheat-sheet (CC-BY 4.0)

12 Mar 2026, 00:00 Z

Start here

Update (Mar 2026):
This post still documents the original 31-PDF scan-heavy pilot and why the first practical routing rule emerged from that audit.
The public OCR benchmark story has moved on since then: the newer workflow-boundary comparison now includes Hunyuan and DeepSeek alongside GLM and FireRed.
At that newer workflow boundary, Hunyuan is the strongest grounded workflow, DeepSeek is the second grounded workflow and the strongest blank-page detector, FireRed remains the best balanced workflow, and GLM remains the fastest typical workflow.
Use this article for the method and the limits, then use the updated market-map and workflow-fit posts for the current shortlist and deployment answer: https://instavar.com/blog/ai-production-stack/OCR_SOTA_Feb_2026_Open_Document_AI_Leaderboard
https://instavar.com/blog/ai-production-stack/Which_OCR_Model_Fits_Which_Workflow_in_2026

Short version

This page is for the method, the evidence, and the limits.

Trust basis: this was run self-hosted on a single RTX 3090 Ti 24 GB box, the raw outputs were kept, the harness was versioned as it changed, and every page in the corpus was reviewed at contact-sheet scale before the disputed pages were checked again at higher zoom.

The next section gives the gist in under a minute. The later sections move into the method, evidence, and appendices.

If you only have 1 minute

What we tested:

  • 31 scanned chemistry PDFs
  • 1331 pages
  • GLM-OCR, dots.ocr-1.5, MonkeyOCR, PaddleOCR PP-StructureV3, and FireRed-OCR
  • one self-hosted workflow on a single

AI video production

Turn AI video into a repeatable engine

Build an AI-assisted video pipeline with hook-first scripts, brand-safe edits, and multi-platform delivery.