How to OCR a PDF — Extract Text from Scanned Documents for Free

Updated April 8, 2026 · 5 min read

You have a scanned PDF — maybe a contract someone faxed you, a receipt you photographed, or an old document that was digitized years ago. You try to select the text, and nothing happens. You try Ctrl+F to search for a word, and the search comes back empty. The document looks like it contains text, but as far as your computer is concerned, it is just a picture.

This is one of the most common frustrations people encounter when working with PDFs. The document is technically a PDF file, but its pages are flat images with no selectable, searchable, or copyable text underneath. To unlock that text, you need OCR — Optical Character Recognition.

This guide explains exactly what OCR is, how it works, and walks you through three methods to extract text from scanned PDFs — starting with a completely free, browser-based approach that does not upload your files anywhere.

What Is OCR and Why Do You Need It?

OCR stands for Optical Character Recognition. It is the technology that looks at an image of text — whether that image comes from a scanner, a camera, or a screenshot — and converts it into actual machine-readable text that you can select, copy, search, and edit.

Think of it this way: when you scan a paper document and save it as a PDF, the scanner is essentially taking a photograph. The resulting PDF contains a high-resolution image of each page, not actual text data. Your PDF viewer renders it and it looks perfectly readable to human eyes, but the file contains zero text characters. It is the digital equivalent of a photocopy — visually identical, but fundamentally different from a document you typed on a computer.

You need OCR any time you are dealing with:

Scanned documents — contracts, invoices, tax forms, medical records, or any paperwork that was digitized using a scanner or multifunction printer.
Photographed text — receipts captured with your phone camera, whiteboard photos, screenshots of text from non-copyable sources.
Image-only PDFs — files that look like normal PDFs but were generated from scanned images without any OCR layer applied during scanning.
Old archived documents — legacy documents from the early days of digital archiving when OCR was not standard in scanning workflows.
Protected or restricted PDFs — some PDFs disable text selection through security settings, and OCR on a screenshot can work around that limitation.

Without OCR, these documents are essentially dead weight in a digital workflow. You cannot search them, you cannot copy text from them, you cannot feed them into translation tools, and you cannot extract data from them programmatically. OCR brings them back to life.

How OCR Works (Without the Technical Jargon)

At a high level, OCR follows a straightforward pipeline. The software receives an image — a scanned page, a photo, or a rendered PDF page — and processes it through several stages to produce text output.

Step 1: Image preprocessing. The software cleans up the image. It straightens tilted pages (deskewing), adjusts contrast so text stands out from the background, removes noise like scanner artifacts or paper texture, and converts the image to a format that makes character boundaries clearer. This step has a massive impact on accuracy — a clean, high-contrast image will produce far better results than a dark, blurry photograph.

Step 2: Character segmentation. The engine identifies where individual characters are on the page. It detects lines of text, then breaks each line into individual words, and each word into individual characters. For Latin scripts this is relatively straightforward. For scripts like Arabic, Hindi, or Chinese — where characters connect or have different spatial relationships — this step is significantly more complex.

Step 3: Pattern recognition. Each isolated character shape is compared against known patterns for the selected language. Modern OCR engines use neural networks trained on millions of text samples, so they can handle a wide variety of fonts, sizes, and printing qualities. The engine assigns a confidence score to each character match — essentially saying "I am 98% sure this is the letter A" or "I am 72% sure this is either a 0 or an O."

Step 4: Post-processing. The raw character output is refined using language models and dictionaries. If the engine recognized "tbe" but the language model knows that "the" is far more likely in that context, it corrects the output. This step catches many individual character errors and significantly improves the final accuracy.

The result is a block of plain text that corresponds to what appeared in the original image. The quality depends heavily on the input — a 300 DPI scan of a cleanly printed document will produce near-perfect text, while a blurry phone photo of a crumpled receipt might be 70% accurate at best.

Method 1: AllPDF.tools OCR (Free, Private, Browser-Based)

The fastest way to OCR a scanned PDF is to use AllPDF.tools OCR. It runs entirely in your browser using Tesseract.js — there are no file uploads, no server processing, and no accounts required. Your document never leaves your device.

Step-by-Step Instructions

Open the OCR tool. Go to AllPDF.tools OCR in any modern browser (Chrome, Firefox, Edge, Safari).
Upload your file. Click the upload area or drag and drop your scanned PDF. The tool also accepts image files directly — JPG, PNG, BMP, WebP, and TIFF are all supported.
Select the language. Choose the language of the text in your document from the dropdown. There are 12 languages available, including English, Spanish, French, German, Portuguese, Italian, Dutch, Russian, Chinese (Simplified), Japanese, Korean, and Arabic. Selecting the correct language is critical for accuracy — the OCR engine loads a language-specific trained model.
Click Start OCR. The processing begins. On the very first run, the tool downloads the OCR engine and language data — this is roughly 15 MB depending on the language. This download happens once and is cached by your browser for subsequent uses, so future OCR operations start almost instantly.
Wait for processing. OCR is computationally intensive. A single page typically takes 5 to 30 seconds depending on your device's processing power and the complexity of the page. Multi-page documents take proportionally longer. You will see a progress indicator while it works.
Copy or download the text. Once processing completes, the extracted text appears on screen. You can copy it directly to your clipboard or download it as a plain text file.

Tip: If your scanned PDF has many pages, consider splitting it first using the PDF Split tool and OCR-ing individual sections. This keeps processing times manageable and lets you verify accuracy page by page.

Extract text from any scanned PDF or image — free, private, no sign-up.
Open OCR Tool

Method 2: Google Drive OCR (Free, Requires Google Account)

Google Drive has a little-known OCR feature built into Google Docs. It is not obvious and Google does not advertise it prominently, but it works surprisingly well for straightforward documents.

How to Use Google Drive for OCR

Upload the PDF to Google Drive. Open drive.google.com and upload your scanned PDF file.
Right-click the file and select "Open with" → "Google Docs."
Wait for processing. Google will automatically run OCR on the document and open it as a Google Doc with the extracted text. Images from the original PDF may appear inline with the text.
Copy the text from the Google Doc, or download the document in your preferred format (DOCX, TXT, etc.).

Pros: Free, decent accuracy on printed English text, automatically preserves some formatting, works well for simple single-column documents.

Cons: Requires uploading your document to Google's servers — not ideal for sensitive or confidential files. Struggles with complex layouts, multi-column text, and tables. Limited language support compared to dedicated OCR tools. Cannot handle very large files. No control over OCR settings or parameters.

Tip: Google Drive OCR works best on files under 2 MB with clear, high-contrast text. For larger or more complex documents, a dedicated OCR tool will produce better results.

Method 3: Adobe Acrobat Pro (Paid, Best Quality)

Adobe Acrobat Pro ships a proprietary OCR engine refined over decades, strong on degraded scans and mixed-language archives. For scans that are reasonably clean — invoices, receipts, printed books, tax documents, form submissions — Tesseract (the engine behind our browser OCR tool) returns accuracy within a few percentage points of paid engines, which is why it has been adopted as the default OCR layer inside everything from Google's own products to major medical-record systems.

How to OCR in Adobe Acrobat Pro

Open the scanned PDF in Acrobat Pro.
Go to "Scan & OCR" in the Tools panel (or "Edit PDF" which auto-triggers OCR on scanned files).
Click "Recognize Text" → "In This File."
Select the language, output style (searchable image, editable text, or searchable image exact), and page range.
Click "Recognize Text" and wait for processing.
The text layer is embedded directly into the PDF — you can now select text, search, and copy from the document. Save the file to preserve the OCR layer.

Pros: Best-in-class accuracy. Embeds a text layer into the original PDF (so the PDF becomes searchable while keeping its original appearance). Handles complex multi-column layouts, tables, headers, and footers intelligently. Supports batch OCR on hundreds of files. Excellent for professional and legal workflows.

Cons: Requires a paid subscription (approximately $20-23/month). Desktop software only — requires installation. Processing happens on your local machine, so performance depends on your hardware. Overkill for occasional use.

Tips for Getting the Best OCR Results

OCR accuracy is not magic — it depends almost entirely on the quality of the input. Here are the practical factors that make the biggest difference:

Scan at 300 DPI or higher. This is the single most impactful setting. Scanning at 150 DPI might look fine on screen, but the OCR engine will struggle with character edges. At 300 DPI, character boundaries are clean and recognition rates jump dramatically. For very small text (footnotes, fine print), 400-600 DPI is even better.
Use clean, flat originals. Wrinkled, folded, or stained paper introduces noise that confuses the OCR engine. If you are scanning a crumpled receipt, flatten it as much as possible before scanning. Place a white sheet of paper behind thin documents to prevent bleed-through from the scanner lid.
Select the correct language. OCR engines load language-specific models that include character sets, dictionaries, and linguistic rules. Running English OCR on a French document will produce systematic errors on accented characters. Running it on Chinese text will produce complete gibberish. Always match the language setting to the document's actual language.
Ensure good lighting for photos. If you are photographing a document with your phone instead of scanning it, lighting is everything. Use even, diffused lighting — avoid shadows across the text, avoid glare from glossy paper, and avoid harsh directional light that creates uneven contrast. Natural daylight near a window works well.
Keep the document aligned. Tilted or skewed pages reduce accuracy. Most OCR engines include auto-deskew, but they work best when the tilt is minor. If your scan is significantly rotated, straighten it first using an image editor or the PDF Rotate tool.
Use black text on white background. High contrast between text and background is what the OCR engine relies on. Colored text on colored backgrounds, watermarks behind text, or very light gray text on white paper all reduce accuracy significantly.

OCR Limitations — What It Cannot Do Well

OCR has improved enormously over the past decade, but it still has clear limitations that you should understand before relying on it for critical tasks.

Handwriting recognition is poor. Modern OCR engines are trained primarily on printed and typed text. Neat, consistent handwriting in block letters might produce passable results, but cursive handwriting, messy handwriting, or mixed handwriting styles will produce mostly garbage. Specialized handwriting recognition (ICR — Intelligent Character Recognition) exists but is a different technology from standard OCR.
Complex layouts get jumbled. Multi-column text, text wrapped around images, sidebar callouts, tables with merged cells, and documents with complex visual hierarchies confuse OCR engines. The engine may read across columns instead of down them, merge table cells incorrectly, or interleave text from different sections. Simple, single-column documents produce the best results.
Decorative and unusual fonts fail. OCR models are trained on standard fonts — Times New Roman, Arial, Calibri, and their equivalents. Highly decorative fonts, artistic typography, heavily stylized logos, and ornamental text are often unrecognizable. If your document uses a novelty font, expect significant errors.
Very low resolution produces garbage. If the input image is below roughly 150 DPI, character shapes become too ambiguous for reliable recognition. A tiny, compressed JPEG of a document will produce output that is more wrong than right. There is no software fix for insufficient resolution — you need a better source image.
Mathematical formulas and special notation. Standard OCR cannot reliably interpret mathematical equations, chemical formulas, musical notation, or other specialized symbolic systems. These require purpose-built recognition systems.
Degraded or damaged documents. Faded ink, water damage, heavy creases through text, stamps overlapping text, and other physical damage reduce accuracy proportionally to severity. OCR can tolerate minor imperfections, but heavily damaged documents may need manual transcription.

Frequently Asked Questions

Is the OCR tool really free?

Yes, completely free. AllPDF.tools OCR runs entirely in your browser using open-source technology. There are no usage limits, no watermarks, no sign-up requirements, and no premium tier. You can OCR as many documents as you want.

What languages are supported?

The tool supports 12 languages: English, Spanish, French, German, Portuguese, Italian, Dutch, Russian, Chinese (Simplified), Japanese, Korean, and Arabic. Each language uses a dedicated trained model that is downloaded automatically when you select it.

Does OCR work on handwriting?

Poorly, in most cases. Standard OCR engines including the one used by AllPDF.tools are optimized for printed and typed text. Very neat, consistent block-letter handwriting may produce partially usable results, but cursive or messy handwriting will not be recognized accurately. For handwriting, you would need a specialized ICR (Intelligent Character Recognition) service.

How accurate is OCR?

On a clean, 300+ DPI scan of a printed document in a supported language, you can expect 95-99% character accuracy. That means on a typical page of 2,000 characters, you might see 20 to 100 errors — mostly in punctuation, special characters, or characters that look similar (like "l" and "1" or "O" and "0"). Lower quality inputs produce proportionally lower accuracy. Always proofread OCR output before using it in any important context.

Can I OCR a document and then convert it to Word?

Yes. First, use the OCR tool to extract the text, then paste it into any word processor. Alternatively, if you want to maintain more formatting, use the Google Drive method described above — it produces a Google Doc that you can download as a DOCX file directly.

Does the OCR tool work offline?

Partially. After the first use, the OCR engine and language model are cached in your browser. If you have already loaded the tool and the required language data, it can work without an active internet connection. However, the initial load requires an internet connection to download the engine files (approximately 15 MB). For fully offline OCR, desktop software like Adobe Acrobat or Tesseract CLI is a better choice.

What file formats can I OCR?

The AllPDF.tools OCR tool accepts scanned PDFs as well as image files in JPG, PNG, BMP, WebP, and TIFF formats. If your document is in a different format, convert it to PDF or an image first, then run OCR on it.

Why does the first OCR take longer?

On the very first run, the tool downloads the Tesseract.js OCR engine and the trained language data file to your browser. This is approximately 15 MB depending on the language selected. Once downloaded, these files are cached by your browser, so subsequent OCR operations on the same language start almost instantly without any additional download.