OCR Cleanup at Scale with Rust
Old printing + Decades-old OCR = Artifacts Everywhere
~450 regex patterns compiled and ready
Best tool for each part of the job
Pattern sync issue between Python and Rust
1.2M false positives caught before production
The remaining 225 errors weren't fixable with regex patterns
Data-driven decision: Stop adding patterns, pivot to statistical noise filtering
Multiplier breakdowns are approximate decomposition of the overall 111.9× speedup
Domain knowledge + rapid implementation = production-scale results
From bottleneck to breakthrough in days, not weeks
Next story: Training a timecapsule LLM that doesn't know about WWI
The timecapsule-data project is open source
View on GitHubData as of: February 2026
Feature status: Active — project is open source at github.com/robotdad/timecapsule-data
Research performed:
github.com/robotdad/timecapsule-dataMetrics source:
Gaps & estimates:
Primary contributors: robotdad (Brian Krabach) — project creator and domain expert