Active

590 Million Corrections
in 7 Hours

OCR Cleanup at Scale with Rust

robotdad & Amplifier
The Challenge

The Corpus from Story 1

1.4M
Documents Downloaded
99.78%
Need OCR Cleanup
pre-1800 to 1914
Historical Scans

Old printing + Decades-old OCR = Artifacts Everywhere

The Problem

The OCR Artifacts

"Congreſs ſhall make no law" "buſineſs diſcuſsion" "pubhsh the report" "TI1E committee met" "Chapter Five — Downloaded from Google Books"
The Baseline
33.75
Days
Python baseline: 0.4 files/second
1.17M files × 2.5 seconds each

"We can't wait a month for this."
The Patterns

Building the Pattern Library

150+
Long-s Patterns
ſ → s, eſs → ess, aſh → ash
200+
OCR Corrections
hstory → history, TIIE → THE
50+
Watermark Removal
Google Books, Archive.org stamps
50+
Other Fixes
Whitespace, hyphenation, etc.

~450 regex patterns compiled and ready

The Solution

Rust + PyO3

Best tool for each part of the job

Validation

The Testing Ladder

Test 1: 100 Files
The Smoke Test
10 seconds • 20,099 substitutions • Does it run without crashing?
Test 2: 1,000 Files
Pattern Validation
28 seconds • 157,146 substitutions • Vocabulary extraction finds pattern sync bug
Test 3: 100,000 Files
Precision Test
21 minutes • 43.3M substitutions • The "enough is enough" moment
Quality Gate

Bug Caught During Testing

"Wait, those are perfectly valid words. What's happening?"
— robotdad, reviewing vocabulary analysis

Pattern sync issue between Python and Rust
1.2M false positives caught before production

Precision
99.98%
Precision
100,000 files tested • 1,014,036 suspicious words flagged
Only 225 were false positives

"Only 225 cleared by dictionary lookup. That seems really good to me."
Decision Point

The "Enough is Enough" Moment

The remaining 225 errors weren't fixable with regex patterns

Data-driven decision: Stop adding patterns, pivot to statistical noise filtering

Production

Production Run

1.17M
Files Processed
590M
Corrections Made
7h 14m
Total Runtime
111.9×
faster than Python baseline
Performance

Where the Speedup Came From

Python Baseline
• Interpreted bytecode
• Regex compiled per file
• Single-threaded (GIL)
• String copying overhead 0.4 files/sec
Rust + PyO3
• Compiled native code (~10×)
• Pre-compiled regex (~5×)
• 16-core parallelism (~2.3×)
• Zero-copy strings (~1.5×) 44.75 files/sec

Multiplier breakdowns are approximate decomposition of the overall 111.9× speedup

Collaboration

Human-AI Collaboration

robotdad
• Identified performance bottleneck
• Made architectural decision (Rust)
• Defined pattern categories
• Directed testing & validation
• Decided when to stop iterating
Amplifier
• Implemented Rust core + PyO3
• Built 450+ pattern library
• Benchmarked approaches
• Integrated Rayon parallelization
• Generated validation reports

Domain knowledge + rapid implementation = production-scale results

Impact

What This Demonstrates

From bottleneck to breakthrough in days, not weeks

Outcome

Ready for Training

1.17M
Clean Documents
pre-1800 to 1914
Pre-WWI Era

Next story: Training a timecapsule LLM that doesn't know about WWI

Open Source

Try It Yourself

The timecapsule-data project is open source

View on GitHub
450+ OCR pattern library
Rust core with PyO3
Hybrid Python/Rust architecture
Tested on 1.17M files
Sources

Research Methodology

Data as of: February 2026

Feature status: Active — project is open source at github.com/robotdad/timecapsule-data

Research performed:

Metrics source:

Gaps & estimates:

Primary contributors: robotdad (Brian Krabach) — project creator and domain expert

More Amplifier Stories