OCR PDF

Extract text from scanned PDFs and images using OCR.

Drop a scan or photo of a document

or click to browse (PDF or image files)

Choose File
Processing...

How to OCR a scanned PDF

  1. Upload a scanned PDF or an image (JPG, PNG, WebP, TIFF). The tool accepts both — there is no need to convert image to PDF first.
  2. Pick the language. Hindi, Spanish, French, German, Japanese, Korean, simplified Chinese, Arabic, and Russian are all supported alongside English.
  3. Click Start OCR. Tesseract (running in a WebAssembly worker) walks each page, recognises the characters, and appends them to the text panel — copy to clipboard or download as TXT when done.

What affects accuracy

Input quality is the single biggest factor. Clean 300 DPI scans in printed Times-Roman-style fonts reach ~98% accuracy. Phone photos of wavy paper under poor light, or handwritten cursive, drop to 60–80%. Two simple tricks: (1) flatten the paper — even and well-lit beats high-megapixel with shadows; (2) crop tightly so the OCR engine does not waste time on desk texture. Hindi and other non-Latin scripts benefit especially from cropping out the page edges.

Frequently asked

Does OCR run on my device?

Yes — Tesseract.js loads the language model from a CDN the first time you run it, then does all the pixel-to-text work in a WebAssembly worker inside your browser. Your pages are never sent anywhere.

Can it handle handwritten notes?

Printed text only. Tesseract is trained on typeset fonts, which is what the overwhelming majority of scanned documents actually contain — invoices, receipts, books, tax papers, printed forms, bank statements. Handwriting recognition is a different model class with a different training pipeline, and keeping those two concerns separated is what lets this OCR run entirely inside your browser without shipping your scans anywhere for processing.

How long does a big PDF take?

Roughly 5–15 seconds per page on a mid-range laptop, slower on phones. A 50-page document typically finishes in a few minutes — leave the tab open.

Can I OCR a mixed-language document?

Yes — pick a bilingual combo from the language picker. English + Hindi, + Tamil, + Bengali, + Telugu and + Marathi are wired up for Indian government and legal documents that mix English with a regional script. The OCR engine reads both at once with no accuracy penalty on either side. For documents in other combinations, OCR each page separately with the right language.

Privacy: OCR is CPU-heavy but entirely local — the Tesseract language model is downloaded once and cached. Your scanned pages, recognised text, and export TXT all stay on your device.