Skip to main content

OCR a scanned research archive locally — without uploading IRB-restricted, embargoed, or unpublished material

Point FileHop at a folder of image-only PDFs, TIFFs, or scanned pages. OCR runs locally using MiniCPM-V, Chandra OCR, or olmOCR-2; extracted text lands next to each scan, indexed by Spotlight or Windows Search. Models are downloaded on first use (5.7–11GB from HuggingFace) and are not bundled with the FileHop install.

Why local OCR matters for researchers

You inherit a folder of 300 PDFs from a former lab member. Half are image-only scans — old papers, photocopied protocols, archival fieldwork, handwritten lab notebooks. You can open them, but Spotlight and Windows Search return nothing when you look for ‘hippocampus’ or ‘p < 0.05’ or a participant pseudonym. Until those pages get a text representation, the archive is opaque.

The obvious fix is to drag them into a free online OCR service. That fix fails for researchers in a category of cases that nobody else in this SERP is honest about: IRB-restricted protocols, embargoed manuscripts, unpublished data, sensitive interview transcripts, material covered by a data-use agreement, and any scan whose third-party-server exposure your institution would treat as a documented compliance risk. Uploading those files to a SaaS OCR vendor is the exact trade-off the audience reading this guide is trying to avoid.

The SERP gap. Cloud OCR (Adobe online, Smallpdf, the SaaS handwriting-OCR vendors) requires upload. Tesseract is open-source and runs locally, but it is pre-VLM-era on multi-column scientific layouts, dense tables, math, and handwriting. ABBYY FineReader and Acrobat Pro also process locally on the desktop but they are paid commercial tools without a modern-VLM backend. The lane this guide fills: a desktop app that runs a modern Vision-Language Model on your own machine, with an honest model-choice walkthrough across the content types researchers actually have.

Note on scope: Local processing is not a substitute for IRB approval, a data-use agreement, or your institution's data-classification policy. It reduces a specific risk category — third-party-server exposure during the OCR step. The broader determination belongs to your IRB, your institution, and you.

Pick the OCR model that fits your RAM and your content

FileHop auto-recommends a model based on your system RAM. You can override that choice if your archive has specific content needs (handwriting, non-Latin scripts, English-only academic papers). These three are the load-bearing default variants — Q4_K_M for each family — that fit on most researcher laptops. Higher-quantization variants exist for 14–16GB+ machines that want maximum accuracy on archival material.

Default

MiniCPM-V 2.6 Q4_K_M

RAM
8GB+ RAM (tight on 8GB — close other apps)
Download
~5.7GB download (model + multimodal projector)
Languages
30+ languages — Latin, CJK, Arabic, Hindi, Thai, Cyrillic, Greek, Hebrew
Capabilities
Handwriting · Tables · Math

Best for: General-purpose multi-language OCR. Pick this if you don't know which to pick — the RAM picker recommends it by default.

Not for: Maximum accuracy on dense multi-column scientific layouts — olmOCR-2 is stronger there if your content is English-only.

lmstudio-community/MiniCPM-V-2_6-GGUF on HuggingFace →
Multi-language + handwriting

Chandra OCR Q4_K_M

RAM
12GB+ RAM
Download
~7.3GB download (model + F32 multimodal projector)
Languages
40+ languages including non-Latin scripts — Bengali, Tamil, Telugu, Urdu, Burmese, Khmer, Lao, Georgian, Amharic, Swahili, Yoruba, Igbo, Zulu, Xhosa
Capabilities
Handwriting · Tables · Math · Direct Typst output

Best for: Multi-language archival material, especially non-Latin scripts. Strongest handwriting + complex tables + math support of the three. Direct Typst output for math-heavy academic content.

Not for: 8GB-RAM machines — use MiniCPM-V instead. Chandra Q4 needs 12GB minimum.

noctrex/Chandra-OCR-GGUF on HuggingFace →
82.4 olmOCR-Bench (Allen AI)

olmOCR-2 Q4_K_M

RAM
10GB+ RAM
Download
~6GB download
Languages
English only
Capabilities
Tables · Math (no handwriting)

Best for: Academic-paper layouts in English. 82.4 score on olmOCR-Bench (Allen AI, October 2025; RLVR-trained on Qwen2.5-VL-7B-Instruct) — the strongest published benchmark of the three on multi-column scientific layouts, math formula conversion, and table parsing. Beats Marker (76.1) and MinerU (75.8).

Not for: Handwriting (not supported) or any non-English content.

bartowski/allenai_olmOCR-2-7B-1025-GGUF on HuggingFace →

Higher-RAM variants for archival or critical documents

If your machine has 14–16GB+ RAM and you're working on archival or critical documents where every character matters, the Q5_K_M and Q8_0 variants of Chandra (8.16GB / 11.02GB) and olmOCR-2 (6.79GB / 9.45GB) offer higher accuracy on the same architecture. Same model family, larger quantization, slower per-page. FileHop exposes them in the model picker but recommends Q4_K_M as the default for each family.

If your machine has less than 8GB RAM

No local VLM will fit. FileHop exposes an opt-in cloud OCR fallback (OpenAI or Gemini, your API key) for this case — but this is not the recommended path for IRB-restricted, embargoed, or unpublished material. It is the explicit trade-off route when you knowingly choose cloud accuracy over local privacy. The cloud fallback is off by default and asks for explicit consent per task.

End-to-end workflow on a folder of image-only PDFs

The article's load-bearing operational section. Tested against archives of 50–300 PDFs.

  1. 1

    Open the archive folder in FileHop

    Drag the folder of image-only PDFs (or TIFFs, JPGs, PNGs of scanned pages) into FileHop. The app sees the whole folder — no individual upload step, no Docker, no LM Studio prompt template. If you have multi-page TIFFs from a flatbed scanner, those work directly too.

  2. 2

    Pick the OCR model that fits your RAM and content

    FileHop auto-recommends a model based on your system RAM (8GB → MiniCPM-V; 12GB+ → Chandra OCR). Override if your content is English-only academic papers (choose olmOCR-2) or multi-language with handwriting (choose Chandra). First selection prompts a one-time HuggingFace download — 5.7GB for MiniCPM-V, 7.3GB for Chandra Q4, 6GB for olmOCR-2 Q4. Subsequent runs load from the local cache. Downloads are resumable, cancelable, and pauseable.

  3. 3

    Pick the output format

    Available formats: Plain Text, Markdown, JSON, HTML, or Typst (Chandra direct; other models convert through a two-pass Markdown→Typst step). For full-text search across a research archive, Markdown is the recommended default — it preserves paragraph and heading structure, renders in your editor of choice, and is indexed by Spotlight (Mac) and Windows Search alongside the source PDF. JSON is useful when you intend to feed the output into a downstream pipeline (Pandas, a Zotero plugin, custom analysis).

  4. 4

    Run the batch

    FileHop walks the folder page-by-page. Per-page progress events fire as each page is converted to an image, preprocessed (resized to fit the model's vision encoder), encoded, and sent to the local VLM server. Output lands as a sibling file next to each source PDF (e.g., paper_001.pdf → paper_001.md in the same folder). Nothing leaves the machine. On CPU-only machines a 300-page archive can take hours; let it run overnight if needed — per-page progress is visible the whole time.

  5. 5

    Required habit — not optional

    Verify a representative sample

    VLM OCR is probabilistic. Open three random extracted markdown files alongside their source scans. Copy a sentence from the extracted text and search the source page visually — does it match? Open one with a table and check that columns align. Open one with math (if applicable) and confirm the equations parsed reasonably. This 60-second spot-check is the article's required habit. If accuracy looks poor on a particular file, try Chandra OCR Q5_K_M or Q8_0 (larger model, same family) on that one file, or fall back to OCRmyPDF + Tesseract for printed-only-Latin content where Tesseract is competitive.

  6. 6

    Search the archive

    Spotlight (Mac) or Windows Search will now index the extracted .md or .txt files alongside their original .pdf scans. A search for ‘hippocampus’ or ‘p < 0.05’ or a participant pseudonym surfaces the matching markdown file; you click through to the original scan from there. The extracted text and the source scan are co-located in the same folder — both stay on disk, neither has been uploaded.

A note on what FileHop does NOT produce

FileHop extracts recognized text to a separate Markdown / TXT / JSON / HTML / Typst file next to the source scan. It does NOT write a hidden text layer over the original PDF — the format that Adobe Acrobat and ABBYY FineReader call a ‘searchable PDF’. If your downstream workflow requires a single searchable-PDF artifact (e.g., for sharing with a co-author who does not have your folder structure), pass the source PDF through OCRmyPDF after this step — it is open-source, runs locally, and adds a text layer to image-only PDFs. We considered claiming searchable-PDF output here; we cannot honestly until the feature ships.

OCRmyPDF — open-source local CLI for searchable-PDF output →

Honest scope — what local VLM OCR is and is not

This is the habit-teaching section, not a marketing disclaimer. Read it before you trust extracted text for anything that matters.

  • VLM OCR is probabilistic. Models can hallucinate characters (especially on smudged or low-contrast scans), misread similar glyphs (rn ↔ m, cl ↔ d, degraded handwriting strokes), drop tabular structure, transcribe an equation as the closest-looking ASCII approximation rather than as a true LaTeX or Typst expression, or skip a page entirely if image preprocessing fails. The copy-paste / spot-check verification habit in Step 5 is the response.
  • Multi-column scientific layouts, dense tables, equations, and handwriting are where modern VLMs (olmOCR-2, Chandra OCR) substantially beat Tesseract — the document types where Tesseract was historically weak. ‘Substantially better than Tesseract on hard content’ is not the same as ‘verified accurate.’ For evidence-grade material where every character matters (court exhibits, regulatory submissions), specialist commercial OCR (ABBYY FineReader, Adobe Acrobat Pro) remains the route. The lawyer-cluster sibling guide documents that routing explicitly.
  • Models do not bundle with the FileHop install. First use downloads 5.7–11GB from HuggingFace (MiniCPM-V 5.72GB; Chandra Q4 7.34GB; Chandra Q5 8.16GB; Chandra Q8 11.02GB; olmOCR-2 Q4 6.03GB; olmOCR-2 Q5 6.79GB; olmOCR-2 Q8 9.45GB). Downloads are resumable, cancelable, and pauseable. Once a model is on disk it never needs to be downloaded again. HuggingFace URLs are visible to the user before the download starts.
  • FileHop is Mac + Windows only. There is no Linux build today. On Linux, OCRmyPDF (Tesseract-backed) and the Chandra / olmOCR-2 reference implementations on the model authors' GitHub repos remain available; the FAQ documents the Linux alternatives.
  • Local OCR does not certify IRB compliance, HIPAA compliance, GDPR processor status, FERPA obligations, or any institutional data-classification policy. It reduces the specific risk of third-party-server exposure during the OCR step. The broader determination belongs to your institution and you.

Why this guide routes differently from the lawyer-cluster sibling

Both audiences share the same underlying privacy concern — uploading sensitive material to a third-party OCR service is the trade-off they are trying to avoid. The split is on the accuracy ceiling. The two articles are reciprocally cross-linked because the routing is genuinely different.

Researcher (this guide) — lead with local VLM

  • Material is often IRB-restricted, embargoed, or unpublished. Uploading is a documented compliance risk.
  • Tolerable accuracy is high but not evidence-grade — the researcher verifies with a copy-paste spot-check and the source scans remain co-located in the same folder.
  • Languages, scripts, handwriting, math, and multi-column layouts are common — exactly the content type where modern VLMs (Chandra OCR, olmOCR-2) substantially beat Tesseract.
  • Recommendation: run OCR locally with the model that fits your RAM and content type. This guide.

Lawyer (sibling guide) — route to specialist tools

  • Material is often evidence — court exhibits, deposition transcripts, signed contracts.
  • Required accuracy is evidence-grade. Local imperfection is a malpractice risk.
  • ABBYY FineReader and Adobe Acrobat Pro are paid specialist tools with decades of legal-document OCR optimization and audit trails. They process locally on the desktop.
  • Recommendation: route OCR to a specialist tool. See the sibling guide →
Convert old client scans (for lawyers) →

Shared concern, different ceiling: Both audiences share the underlying privacy concern. The split is on the accuracy ceiling. If you're a lawyer who landed here, follow the sibling routing. If you're a researcher who landed on the lawyer guide, follow this one.

Troubleshooting — common failure modes and what to try

Practical fixes for the recurring problems that show up across archive content.

Problem What to try
Image-only PDF — nothing happens on copy/paste or search Confirm the source PDF is image-only (pdftotext of the source returns nothing). Run the workflow on the file; check the sibling .md output sitting next to the PDF in the same folder.
Multi-column scientific paper — columns interleave in the extracted markdown olmOCR-2 was trained specifically for multi-column scientific layouts. If your content is English and you have 10GB+ RAM, switch to olmOCR-2 Q4_K_M.
Handwritten lab notebook page — extracted text is garbled olmOCR-2 does NOT support handwriting. Switch to Chandra OCR Q4_K_M (12GB+ RAM) or MiniCPM-V Q4 (8GB+). For very degraded handwriting, escalate to Chandra Q5 or Q8 (14–16GB RAM).
Non-Latin script (Devanagari, Arabic, Tamil, Bengali, Burmese, etc.) — not recognized Switch to Chandra OCR — it supports 40+ languages including the non-Latin scripts listed. MiniCPM-V supports a narrower set.
Math equations — transcribed as ASCII approximation, not LaTeX or Typst Chandra OCR can output direct Typst (supports_direct_typst is set in the model registry). Pick Typst as the output format with Chandra selected. Other models produce Markdown, which FileHop then converts to Typst via a two-pass step built into the OCR service.
TIFF input (single-page or multi-page) — want searchable text FileHop accepts TIFF directly. Use the same workflow as for PDFs — no separate conversion step.
Whole archive is 500+ PDFs — batch is slow VLM OCR runs page-by-page on CPU (or GPU if FileHop detects acceleration). 500 single-page papers will take hours on CPU-only. Let it run overnight; per-page progress events are visible the whole time. If absolute speed matters more than local privacy, the cloud OCR fallback (OpenAI or Gemini) is faster but requires upload.
Linux FileHop does not ship a Linux build today. OCRmyPDF (Tesseract backend, 100+ languages) and the Chandra OCR + olmOCR-2 reference implementations on the model authors' GitHub repos remain available on Linux.
Want a single searchable-PDF artifact (not a sibling markdown file) FileHop does not produce searchable-PDF output. Pass the source PDF through OCRmyPDF (open-source, local, CLI) for that artifact format.

Trust panel — how the OCR pipeline handles your files

Plain mechanics, no marketing language. Read this and the honest-scope block together.

  • Processing is local on Mac and Windows. The OCR service starts a local server on your machine, sends each PDF page (preprocessed as an image) to the model running on the same machine, and writes the output to a sibling file in the same folder.
  • Models live on disk after a one-time HuggingFace download (5.7–11GB depending on which model fits your RAM). The HuggingFace URLs are visible to you before the download starts; downloads are resumable, cancelable, and pauseable.
  • Cloud OCR exists as an opt-in fallback (OpenAI or Gemini, with your API key) for machines that do not meet the local-model RAM requirement. It is off by default and asks for explicit consent per task. It is NOT the recommended path for IRB-restricted, embargoed, or unpublished material.
  • FileHop is not certified for any IRB, HIPAA, GDPR, or FERPA protocol. Those certifications belong to your institution. Local OCR reduces the specific risk category of third-party-server exposure during the OCR step; the broader determination is yours.

Frequently asked questions

Which OCR model should I pick if I don't know what my content is?
Start with MiniCPM-V 2.6 Q4_K_M. It works on 8GB-RAM machines, supports 30+ languages including handwriting, tables, and math, and the FileHop RAM picker recommends it as the default. If your archive turns out to be heavily non-Latin or has lots of handwriting, switch to Chandra OCR Q4_K_M (12GB+). If your archive is English-only academic papers with dense multi-column layouts, switch to olmOCR-2 Q4_K_M (10GB+).
How much disk space do the models take?
Total download per model: MiniCPM-V 2.6 Q4 is 5.7GB; Chandra OCR Q4 is 7.3GB, Q5 is 8.2GB, Q8 is 11GB; olmOCR-2 Q4 is 6GB, Q5 is 6.8GB, Q8 is 9.5GB. You download once per model; subsequent OCR runs load from the local cache.
Does FileHop upload anything during OCR?
No — the default local-VLM path is fully local. The OCR service starts a local server, sends each PDF page (preprocessed as an image) to the model running on your machine, and writes the output to disk. The HuggingFace download happens once when you first select a model; after that, no network call is required for OCR itself. There is a separate opt-in cloud OCR fallback (OpenAI or Gemini) for RAM-constrained machines; it requires an explicit selection and an API key the user supplies.
Can FileHop produce a single searchable-PDF artifact?
Not today. FileHop extracts recognized text to a sibling file (Markdown / TXT / JSON / HTML / Typst) next to the source scan. Both files live in the same folder and Spotlight or Windows Search indexes them together. If you need a single PDF with an invisible text layer over the original scan, run the source through OCRmyPDF (open-source, local, CLI) after this workflow. We considered claiming searchable-PDF output here; we will not until the feature ships.
How accurate is VLM-based OCR vs. Tesseract on a scientific paper?
Modern VLM OCR substantially beats Tesseract on multi-column scientific layouts, dense tables, math, and handwriting — the document types where Tesseract was historically weak. Allen AI's olmOCR-2 scores 82.4 on the olmOCR-Bench benchmark (October 2025), beating Marker (76.1) and MinerU (75.8); Chandra OCR's vendor benchmark beats GPT-5 Mini and Gemini across 90 languages. ‘Substantially better than Tesseract on hard content’ is not the same as ‘verified accurate’ — the spot-check verification habit in the workflow is the response.
What about Chandra OCR's 90-language version?
Chandra 2 (datalab-to, March 2026 release) extends Chandra OCR's base 40+ languages to 90 languages with stronger benchmarks. FileHop ships Chandra OCR Q4 / Q5 / Q8 (the GGUF-quantized variants from noctrex on HuggingFace). The 40+ language list in the registry covers the most common non-Latin scripts (Bengali, Tamil, Telugu, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Urdu, Nepali, Sinhala, Burmese, Khmer, Lao, Georgian, Amharic, Tigrinya, Swahili, Hausa, Yoruba, Igbo, Zulu, Xhosa, etc.).
Does the OCR step satisfy my IRB protocol?
No software can certify your IRB protocol — that determination belongs to your IRB and your institution. Local OCR reduces a specific risk category: it does not send your material to a third-party server during the OCR step. The broader determination of whether your data handling satisfies your protocol is yours. Document this guide as the procedure you used and submit it to your IRB if that is appropriate for your protocol.
What if I'm on Linux?
FileHop does not ship a Linux build today (Mac + Windows only). On Linux, OCRmyPDF (Tesseract backend, 100+ languages, open-source) covers printed-text scanned PDFs well, and the Chandra OCR + olmOCR-2 reference implementations on the model authors' GitHub repos are available for VLM-based local OCR if you are comfortable with Python or Docker.
Can I OCR a handwritten lab notebook?
Yes, with Chandra OCR (Q4_K_M for 12GB RAM; Q5_K_M for 14GB; Q8_0 for 16GB+) or MiniCPM-V 2.6 Q4 (8GB+). olmOCR-2 does not support handwriting. Expect to verify more carefully than on printed text — handwritten scientific notation, abbreviations, and figure annotations are the hardest content for any OCR system, and Chandra's vendor benchmark explicitly tests handwriting + tables + math + layout but does not claim parity with a trained human transcriber. Treat the extracted text as the first pass of a transcription, not the final transcription.
What if the cloud OCR fallback gives better accuracy on a specific page?
If you have explicitly decided that for a particular non-sensitive page (e.g., a published paper that is already on the public web, an OCR of your own already-published manuscript) you would rather trade local privacy for cloud accuracy, FileHop exposes an opt-in cloud OCR path (OpenAI or Gemini, your API key). This is NOT the recommended path for IRB-restricted, embargoed, or unpublished material. The article documents that the cloud fallback exists for completeness; it does not recommend it for the audience this guide is written for.
How do I cross-reference an extracted markdown file back to a specific page of the source PDF?
The OCR service returns one result per source page; when FileHop runs the workflow on a multi-page PDF, per-page progress events fire with current_page / total_pages. The sibling markdown file is named after the source PDF (paper_001.pdf → paper_001.md). For per-page cross-reference, set the output format to JSON — each page is a separate JSON object with the page_number and total_pages fields populated from the OCR service.
I'm a lawyer reading this — should I use local VLM OCR for evidence?
No — see the sibling guide ‘Convert old client scans (for lawyers)’. The accuracy ceiling that researchers can tolerate (with a spot-check verification habit and the source scans co-located) is below evidence-grade. For court exhibits, signed contracts, deposition transcripts, and any other material where a misread character is a malpractice risk, specialist commercial OCR (ABBYY FineReader, Adobe Acrobat Pro) running locally on the desktop remains the route. Both audiences share the underlying privacy concern; the split is on the accuracy ceiling.

Sources and citations

Load-bearing references for the model claims and the routing framing.

Download FileHop and OCR your archive locally

Models are downloaded on first use from HuggingFace (5.7–11GB depending on which model fits your RAM). They are not bundled with the install. The OCR runs on your machine; your archive does not leave it. Mac and Windows.