How robotdad & Amplifier built production-scale download infrastructure
for 2.3 million historical documents
A story about API wrestling, rate limit nightmares,
and what's possible with conversational development
What if you trained an AI that doesn't know about computers, smartphones, or the internet?
Train a language model exclusively on pre-World War I data (before 1914). An AI with the language, knowledge, and perspective of the 19th and early 20th centuries.
A massive corpus of historical documents. Newspapers, books, periodicals—millions of them. All from before 1914. All digitized, downloadable, processable.
Proof-of-concept that showed training LLMs on historical data only
Hayk Grigorian pioneered this concept with his TimeCapsuleLLM project:
Documents, London-focused
Single city, 75-year span
~15× larger dataset
Global pre-WWI scope
Multiple countries, 114-year span
And Internet Archive has completely undocumented rate limits.
The official Internet Archive library provides zero assistance with rate limiting.
robotdad's decision: 784 lines gone. Amplifier rebuilt it from scratch.
Stable downloads. No more bans. The pipeline ran for hours without incident.
Sometimes you need to go lower-level than the official library.
The API says there are 2.3M results. But we can only get 250 items.
numFound: 2,346,892
Total results available
Page 1-5: 50 items each
Page 6: 0 items ← Hard wall
AND NOT operators → same wall at 250— robotdad, during the debugging session
Internet Archive has TWO completely separate systems.
Nice web interface
Great for browsing
Hard 10K pagination limit
We hit wall at 250
❌ Not for bulk access
Direct item access
No pagination theatrics
Reliable metadata fetch
Works every time
✅ Built for scale
Don't rely on search pagination. Enumerate systematically.
Use search API to find initial identifiers and collection patterns
Generate identifiers systematically: date-based scanning, sequential patterns
Use bulk export API to get metadata for each identifier (with custom rate limiter)
Selectivity and discipline, not just downloading everything
Items cataloged from Internet Archive
Applied quality criteria: completeness, OCR quality, metadata
High-quality documents selected for processing
Neither alone would have solved this as efficiently.
The collaboration was the breakthrough.
When libraries don't cut it, build from scratch. Custom rate limiters, adaptive backoff, production error handling.
Find obscure API documentation, dig through GitHub issues, surface technical solutions you didn't know existed.
Try dozens of approaches in hours. Test theories immediately. Get from "it doesn't work" to "production-stable" fast.
When the approach isn't working, redesign. Pattern enumeration, bulk APIs, hybrid strategies—whatever it takes.
Not toy projects. 2.3M items. Multi-day pipelines. Real error handling. Actual production systems.
You make the calls, Amplifier executes. Debugging marathons together. Your intuition + AI implementation.
Next story coming soon:
The ultimate goal:
This download infrastructure was just the beginning.
Data as of: February 2026
Feature status: Active — project is open source on GitHub
Research performed:
Gaps: Exact commit counts and PR history not researched for this deck. Document counts (~2.3M cataloged, ~1.4M selected) are from the project's own reporting.
Primary contributors: robotdad (Brian Krabach)
The entire project is open source
github.com/robotdad/timecapsule-data
✅ Custom rate limiter with adaptive backoff
✅ Bulk export API integration
✅ Pattern enumeration strategies
✅ Production-tested error handling
Curious what you can build with Amplifier?
This was built through conversational development sessions.