OCR a scanned research archive locally — without uploading IRB-restricted, embargoed, or unpublished material
Point FileHop at a folder of image-only PDFs, TIFFs, or scanned pages. OCR runs locally using MiniCPM-V, Chandra OCR, or olmOCR-2; extracted text lands next to each scan, indexed by Spotlight or Windows Search. Models are downloaded on first use (5.7–11GB from HuggingFace) and are not bundled with the FileHop install.
Why local OCR matters for researchers
You inherit a folder of 300 PDFs from a former lab member. Half are image-only scans — old papers, photocopied protocols, archival fieldwork, handwritten lab notebooks. You can open them, but Spotlight and Windows Search return nothing when you look for ‘hippocampus’ or ‘p < 0.05’ or a participant pseudonym. Until those pages get a text representation, the archive is opaque.
The obvious fix is to drag them into a free online OCR service. That fix fails for researchers in a category of cases that nobody else in this SERP is honest about: IRB-restricted protocols, embargoed manuscripts, unpublished data, sensitive interview transcripts, material covered by a data-use agreement, and any scan whose third-party-server exposure your institution would treat as a documented compliance risk. Uploading those files to a SaaS OCR vendor is the exact trade-off the audience reading this guide is trying to avoid.
The SERP gap. Cloud OCR (Adobe online, Smallpdf, the SaaS handwriting-OCR vendors) requires upload. Tesseract is open-source and runs locally, but it is pre-VLM-era on multi-column scientific layouts, dense tables, math, and handwriting. ABBYY FineReader and Acrobat Pro also process locally on the desktop but they are paid commercial tools without a modern-VLM backend. The lane this guide fills: a desktop app that runs a modern Vision-Language Model on your own machine, with an honest model-choice walkthrough across the content types researchers actually have.
Note on scope: Local processing is not a substitute for IRB approval, a data-use agreement, or your institution's data-classification policy. It reduces a specific risk category — third-party-server exposure during the OCR step. The broader determination belongs to your IRB, your institution, and you.
Pick the OCR model that fits your RAM and your content
FileHop auto-recommends a model based on your system RAM. You can override that choice if your archive has specific content needs (handwriting, non-Latin scripts, English-only academic papers). These three are the load-bearing default variants — Q4_K_M for each family — that fit on most researcher laptops. Higher-quantization variants exist for 14–16GB+ machines that want maximum accuracy on archival material.
MiniCPM-V 2.6 Q4_K_M
- RAM
- 8GB+ RAM (tight on 8GB — close other apps)
- Download
- ~5.7GB download (model + multimodal projector)
- Languages
- 30+ languages — Latin, CJK, Arabic, Hindi, Thai, Cyrillic, Greek, Hebrew
- Capabilities
- Handwriting · Tables · Math
Best for: General-purpose multi-language OCR. Pick this if you don't know which to pick — the RAM picker recommends it by default.
Not for: Maximum accuracy on dense multi-column scientific layouts — olmOCR-2 is stronger there if your content is English-only.
Chandra OCR Q4_K_M
- RAM
- 12GB+ RAM
- Download
- ~7.3GB download (model + F32 multimodal projector)
- Languages
- 40+ languages including non-Latin scripts — Bengali, Tamil, Telugu, Urdu, Burmese, Khmer, Lao, Georgian, Amharic, Swahili, Yoruba, Igbo, Zulu, Xhosa
- Capabilities
- Handwriting · Tables · Math · Direct Typst output
Best for: Multi-language archival material, especially non-Latin scripts. Strongest handwriting + complex tables + math support of the three. Direct Typst output for math-heavy academic content.
Not for: 8GB-RAM machines — use MiniCPM-V instead. Chandra Q4 needs 12GB minimum.
olmOCR-2 Q4_K_M
- RAM
- 10GB+ RAM
- Download
- ~6GB download
- Languages
- English only
- Capabilities
- Tables · Math (no handwriting)
Best for: Academic-paper layouts in English. 82.4 score on olmOCR-Bench (Allen AI, October 2025; RLVR-trained on Qwen2.5-VL-7B-Instruct) — the strongest published benchmark of the three on multi-column scientific layouts, math formula conversion, and table parsing. Beats Marker (76.1) and MinerU (75.8).
Not for: Handwriting (not supported) or any non-English content.
Higher-RAM variants for archival or critical documents
If your machine has 14–16GB+ RAM and you're working on archival or critical documents where every character matters, the Q5_K_M and Q8_0 variants of Chandra (8.16GB / 11.02GB) and olmOCR-2 (6.79GB / 9.45GB) offer higher accuracy on the same architecture. Same model family, larger quantization, slower per-page. FileHop exposes them in the model picker but recommends Q4_K_M as the default for each family.
If your machine has less than 8GB RAM
No local VLM will fit. FileHop exposes an opt-in cloud OCR fallback (OpenAI or Gemini, your API key) for this case — but this is not the recommended path for IRB-restricted, embargoed, or unpublished material. It is the explicit trade-off route when you knowingly choose cloud accuracy over local privacy. The cloud fallback is off by default and asks for explicit consent per task.
End-to-end workflow on a folder of image-only PDFs
The article's load-bearing operational section. Tested against archives of 50–300 PDFs.
- 1
Open the archive folder in FileHop
Drag the folder of image-only PDFs (or TIFFs, JPGs, PNGs of scanned pages) into FileHop. The app sees the whole folder — no individual upload step, no Docker, no LM Studio prompt template. If you have multi-page TIFFs from a flatbed scanner, those work directly too.
- 2
Pick the OCR model that fits your RAM and content
FileHop auto-recommends a model based on your system RAM (8GB → MiniCPM-V; 12GB+ → Chandra OCR). Override if your content is English-only academic papers (choose olmOCR-2) or multi-language with handwriting (choose Chandra). First selection prompts a one-time HuggingFace download — 5.7GB for MiniCPM-V, 7.3GB for Chandra Q4, 6GB for olmOCR-2 Q4. Subsequent runs load from the local cache. Downloads are resumable, cancelable, and pauseable.
- 3
Pick the output format
Available formats: Plain Text, Markdown, JSON, HTML, or Typst (Chandra direct; other models convert through a two-pass Markdown→Typst step). For full-text search across a research archive, Markdown is the recommended default — it preserves paragraph and heading structure, renders in your editor of choice, and is indexed by Spotlight (Mac) and Windows Search alongside the source PDF. JSON is useful when you intend to feed the output into a downstream pipeline (Pandas, a Zotero plugin, custom analysis).
- 4
Run the batch
FileHop walks the folder page-by-page. Per-page progress events fire as each page is converted to an image, preprocessed (resized to fit the model's vision encoder), encoded, and sent to the local VLM server. Output lands as a sibling file next to each source PDF (e.g., paper_001.pdf → paper_001.md in the same folder). Nothing leaves the machine. On CPU-only machines a 300-page archive can take hours; let it run overnight if needed — per-page progress is visible the whole time.
- 5
Required habit — not optional
Verify a representative sample
VLM OCR is probabilistic. Open three random extracted markdown files alongside their source scans. Copy a sentence from the extracted text and search the source page visually — does it match? Open one with a table and check that columns align. Open one with math (if applicable) and confirm the equations parsed reasonably. This 60-second spot-check is the article's required habit. If accuracy looks poor on a particular file, try Chandra OCR Q5_K_M or Q8_0 (larger model, same family) on that one file, or fall back to OCRmyPDF + Tesseract for printed-only-Latin content where Tesseract is competitive.
- 6
Search the archive
Spotlight (Mac) or Windows Search will now index the extracted .md or .txt files alongside their original .pdf scans. A search for ‘hippocampus’ or ‘p < 0.05’ or a participant pseudonym surfaces the matching markdown file; you click through to the original scan from there. The extracted text and the source scan are co-located in the same folder — both stay on disk, neither has been uploaded.
A note on what FileHop does NOT produce
FileHop extracts recognized text to a separate Markdown / TXT / JSON / HTML / Typst file next to the source scan. It does NOT write a hidden text layer over the original PDF — the format that Adobe Acrobat and ABBYY FineReader call a ‘searchable PDF’. If your downstream workflow requires a single searchable-PDF artifact (e.g., for sharing with a co-author who does not have your folder structure), pass the source PDF through OCRmyPDF after this step — it is open-source, runs locally, and adds a text layer to image-only PDFs. We considered claiming searchable-PDF output here; we cannot honestly until the feature ships.
OCRmyPDF — open-source local CLI for searchable-PDF output →Honest scope — what local VLM OCR is and is not
This is the habit-teaching section, not a marketing disclaimer. Read it before you trust extracted text for anything that matters.
- • VLM OCR is probabilistic. Models can hallucinate characters (especially on smudged or low-contrast scans), misread similar glyphs (rn ↔ m, cl ↔ d, degraded handwriting strokes), drop tabular structure, transcribe an equation as the closest-looking ASCII approximation rather than as a true LaTeX or Typst expression, or skip a page entirely if image preprocessing fails. The copy-paste / spot-check verification habit in Step 5 is the response.
- • Multi-column scientific layouts, dense tables, equations, and handwriting are where modern VLMs (olmOCR-2, Chandra OCR) substantially beat Tesseract — the document types where Tesseract was historically weak. ‘Substantially better than Tesseract on hard content’ is not the same as ‘verified accurate.’ For evidence-grade material where every character matters (court exhibits, regulatory submissions), specialist commercial OCR (ABBYY FineReader, Adobe Acrobat Pro) remains the route. The lawyer-cluster sibling guide documents that routing explicitly.
- • Models do not bundle with the FileHop install. First use downloads 5.7–11GB from HuggingFace (MiniCPM-V 5.72GB; Chandra Q4 7.34GB; Chandra Q5 8.16GB; Chandra Q8 11.02GB; olmOCR-2 Q4 6.03GB; olmOCR-2 Q5 6.79GB; olmOCR-2 Q8 9.45GB). Downloads are resumable, cancelable, and pauseable. Once a model is on disk it never needs to be downloaded again. HuggingFace URLs are visible to the user before the download starts.
- • FileHop is Mac + Windows only. There is no Linux build today. On Linux, OCRmyPDF (Tesseract-backed) and the Chandra / olmOCR-2 reference implementations on the model authors' GitHub repos remain available; the FAQ documents the Linux alternatives.
- • Local OCR does not certify IRB compliance, HIPAA compliance, GDPR processor status, FERPA obligations, or any institutional data-classification policy. It reduces the specific risk of third-party-server exposure during the OCR step. The broader determination belongs to your institution and you.
Why this guide routes differently from the lawyer-cluster sibling
Both audiences share the same underlying privacy concern — uploading sensitive material to a third-party OCR service is the trade-off they are trying to avoid. The split is on the accuracy ceiling. The two articles are reciprocally cross-linked because the routing is genuinely different.
Researcher (this guide) — lead with local VLM
- • Material is often IRB-restricted, embargoed, or unpublished. Uploading is a documented compliance risk.
- • Tolerable accuracy is high but not evidence-grade — the researcher verifies with a copy-paste spot-check and the source scans remain co-located in the same folder.
- • Languages, scripts, handwriting, math, and multi-column layouts are common — exactly the content type where modern VLMs (Chandra OCR, olmOCR-2) substantially beat Tesseract.
- • Recommendation: run OCR locally with the model that fits your RAM and content type. This guide.
Lawyer (sibling guide) — route to specialist tools
- • Material is often evidence — court exhibits, deposition transcripts, signed contracts.
- • Required accuracy is evidence-grade. Local imperfection is a malpractice risk.
- • ABBYY FineReader and Adobe Acrobat Pro are paid specialist tools with decades of legal-document OCR optimization and audit trails. They process locally on the desktop.
- • Recommendation: route OCR to a specialist tool. See the sibling guide →
Shared concern, different ceiling: Both audiences share the underlying privacy concern. The split is on the accuracy ceiling. If you're a lawyer who landed here, follow the sibling routing. If you're a researcher who landed on the lawyer guide, follow this one.
Troubleshooting — common failure modes and what to try
Practical fixes for the recurring problems that show up across archive content.
| Problem | What to try |
|---|---|
| Image-only PDF — nothing happens on copy/paste or search | Confirm the source PDF is image-only (pdftotext of the source returns nothing). Run the workflow on the file; check the sibling .md output sitting next to the PDF in the same folder. |
| Multi-column scientific paper — columns interleave in the extracted markdown | olmOCR-2 was trained specifically for multi-column scientific layouts. If your content is English and you have 10GB+ RAM, switch to olmOCR-2 Q4_K_M. |
| Handwritten lab notebook page — extracted text is garbled | olmOCR-2 does NOT support handwriting. Switch to Chandra OCR Q4_K_M (12GB+ RAM) or MiniCPM-V Q4 (8GB+). For very degraded handwriting, escalate to Chandra Q5 or Q8 (14–16GB RAM). |
| Non-Latin script (Devanagari, Arabic, Tamil, Bengali, Burmese, etc.) — not recognized | Switch to Chandra OCR — it supports 40+ languages including the non-Latin scripts listed. MiniCPM-V supports a narrower set. |
| Math equations — transcribed as ASCII approximation, not LaTeX or Typst | Chandra OCR can output direct Typst (supports_direct_typst is set in the model registry). Pick Typst as the output format with Chandra selected. Other models produce Markdown, which FileHop then converts to Typst via a two-pass step built into the OCR service. |
| TIFF input (single-page or multi-page) — want searchable text | FileHop accepts TIFF directly. Use the same workflow as for PDFs — no separate conversion step. |
| Whole archive is 500+ PDFs — batch is slow | VLM OCR runs page-by-page on CPU (or GPU if FileHop detects acceleration). 500 single-page papers will take hours on CPU-only. Let it run overnight; per-page progress events are visible the whole time. If absolute speed matters more than local privacy, the cloud OCR fallback (OpenAI or Gemini) is faster but requires upload. |
| Linux | FileHop does not ship a Linux build today. OCRmyPDF (Tesseract backend, 100+ languages) and the Chandra OCR + olmOCR-2 reference implementations on the model authors' GitHub repos remain available on Linux. |
| Want a single searchable-PDF artifact (not a sibling markdown file) | FileHop does not produce searchable-PDF output. Pass the source PDF through OCRmyPDF (open-source, local, CLI) for that artifact format. |
Trust panel — how the OCR pipeline handles your files
Plain mechanics, no marketing language. Read this and the honest-scope block together.
- •Processing is local on Mac and Windows. The OCR service starts a local server on your machine, sends each PDF page (preprocessed as an image) to the model running on the same machine, and writes the output to a sibling file in the same folder.
- •Models live on disk after a one-time HuggingFace download (5.7–11GB depending on which model fits your RAM). The HuggingFace URLs are visible to you before the download starts; downloads are resumable, cancelable, and pauseable.
- •Cloud OCR exists as an opt-in fallback (OpenAI or Gemini, with your API key) for machines that do not meet the local-model RAM requirement. It is off by default and asks for explicit consent per task. It is NOT the recommended path for IRB-restricted, embargoed, or unpublished material.
- •FileHop is not certified for any IRB, HIPAA, GDPR, or FERPA protocol. Those certifications belong to your institution. Local OCR reduces the specific risk category of third-party-server exposure during the OCR step; the broader determination is yours.
Frequently asked questions
Which OCR model should I pick if I don't know what my content is? ▼
How much disk space do the models take? ▼
Does FileHop upload anything during OCR? ▼
Can FileHop produce a single searchable-PDF artifact? ▼
How accurate is VLM-based OCR vs. Tesseract on a scientific paper? ▼
What about Chandra OCR's 90-language version? ▼
Does the OCR step satisfy my IRB protocol? ▼
What if I'm on Linux? ▼
Can I OCR a handwritten lab notebook? ▼
What if the cloud OCR fallback gives better accuracy on a specific page? ▼
How do I cross-reference an extracted markdown file back to a specific page of the source PDF? ▼
I'm a lawyer reading this — should I use local VLM OCR for evidence? ▼
Sources and citations
Load-bearing references for the model claims and the routing framing.
- Allen AI — olmOCR 2 release blog (82.4 olmOCR-Bench, RLVR on Qwen2.5-VL-7B-Instruct, October 2025) →
- datalab-to/chandra — GitHub repository (40+ languages, handwriting, tables, math, layout) →
- noctrex/Chandra-OCR-GGUF — HuggingFace model card (Q4_K_M, Q5_K_M, Q8_0 variants) →
- lmstudio-community/MiniCPM-V-2_6-GGUF — HuggingFace model card →
- bartowski/allenai_olmOCR-2-7B-1025-GGUF — HuggingFace model card →
- ahnafnafee/local-llm-pdf-ocr — GitHub (open-source local OCR precedent) →
- OCRmyPDF — GitHub (open-source local searchable-PDF tool, Linux/Mac/Windows) →
- Polyglossic — OCR for Historical Newsprint: Four Models Worth Running Locally in LM Studio (SERP exemplar for honest model-by-model comparison) →
- HHS OCR HIPAA Privacy Rule guidance (IRB-restricted material framing) →
Download FileHop and OCR your archive locally
Models are downloaded on first use from HuggingFace (5.7–11GB depending on which model fits your RAM). They are not bundled with the install. The OCR runs on your machine; your archive does not leave it. Mac and Windows.