Best OCR for Scanned PDFs - 5 Models Tested on 50 Real Documents

Download printable cheat-sheet (CC-BY 4.0)

28 Mar 2026, 00:00 Z

This guide answers a practical build question: if your PDFs are scanned images, which OCR model should handle each kind of page.

The short version

  • No single model won all document types in the 50-page benchmark.
  • Qianfan had the lowest aggregate character error rate, or CER, at 12.8%.
  • The more useful result is by page type: GLM dominated diagram pages at 6.1% CER, Hunyuan was strong on low-contrast scans at 6.6% CER, and Qianfan swept text, tables, formulas, and worksheets.
  • The right production answer is a routing rule, not one default model for every scanned PDF.

The one-minute decision path

Scanned PDFs fail differently from born-digital PDFs because the text is inside page images. A clean notes page, a diagram question, a faint scan, and a blank separator page need different safeguards.

Read this page in three passes:

  1. use the quick routing table below for the first implementation choice
  2. check the page-type results before trusting the aggregate score
  3. use the failure notes to decide where to add fallback models or human review
If the scanned page is...Start with...Why
text-first notes, tables, formulas, or worksheetsQianfanlowest measured error across those page types
diagram-heavy or figure-linkedGLMstrongest measured diagram-page result
low-contrast or faintQianfan, with Hunyuan as fallback

AI video production

Turn AI video into a repeatable engine

Build an AI-assisted video pipeline with hook-first scripts, brand-safe edits, and multi-platform delivery.