Extract text from scanned PDFs and images using OCR.
or click to browse (PDF or image files)
Input quality is the single biggest factor. Clean 300 DPI scans in printed Times-Roman-style fonts reach ~98% accuracy. Phone photos of wavy paper under poor light, or handwritten cursive, drop to 60–80%. Two simple tricks: (1) flatten the paper — even and well-lit beats high-megapixel with shadows; (2) crop tightly so the OCR engine does not waste time on desk texture. Hindi and other non-Latin scripts benefit especially from cropping out the page edges.
Yes — Tesseract.js loads the language model from a CDN the first time you run it, then does all the pixel-to-text work in a WebAssembly worker inside your browser. Your pages are never sent anywhere.
Printed text only. Tesseract is trained on typeset fonts, which is what the overwhelming majority of scanned documents actually contain — invoices, receipts, books, tax papers, printed forms, bank statements. Handwriting recognition is a different model class with a different training pipeline, and keeping those two concerns separated is what lets this OCR run entirely inside your browser without shipping your scans anywhere for processing.
Roughly 5–15 seconds per page on a mid-range laptop, slower on phones. A 50-page document typically finishes in a few minutes — leave the tab open.
Yes — pick a bilingual combo from the language picker. English + Hindi, + Tamil, + Bengali, + Telugu and + Marathi are wired up for Indian government and legal documents that mix English with a regional script. The OCR engine reads both at once with no accuracy penalty on either side. For documents in other combinations, OCR each page separately with the right language.
Privacy: OCR is CPU-heavy but entirely local — the Tesseract language model is downloaded once and cached. Your scanned pages, recognised text, and export TXT all stay on your device.