Active
HUMAN-AI COLLABORATION • REAL-WORLD ENGINEERING

Taming the Internet Archive

How robotdad & Amplifier built production-scale download infrastructure
for 2.3 million historical documents

A story about API wrestling, rate limit nightmares,
and what's possible with conversational development

THE VISION

Building a Timecapsule LLM

What if you trained an AI that doesn't know about computers, smartphones, or the internet?

The Challenge

Train a language model exclusively on pre-World War I data (before 1914). An AI with the language, knowledge, and perspective of the 19th and early 20th centuries.

What This Requires

A massive corpus of historical documents. Newspapers, books, periodicals—millions of them. All from before 1914. All digitized, downloadable, processable.

STANDING ON SHOULDERS

Inspired by Hayk Grigorian's TimeCapsuleLLM

Proof-of-concept that showed training LLMs on historical data only

The Original TimeCapsuleLLM

Hayk Grigorian pioneered this concept with his TimeCapsuleLLM project:

Hayk's Proof-of-Concept

136K

Documents, London-focused
Single city, 75-year span

Our Production Scale

2.3M

~15× larger dataset
Global pre-WWI scope
Multiple countries, 114-year span

THE CHALLENGE

The Scale

2.3M
Items to download
from Internet Archive
83K+
Historical newspapers
to catalog
~200K
Books across
various collections

And Internet Archive has completely undocumented rate limits.

🚨
HTTP 429

You are being rate limited.

The official Internet Archive library provides zero assistance with rate limiting.

# What happened 12 workers → banned immediately 2 workers → still banned Added delays → STILL banned Made delays longer → STILL BANNED # Found in the docs: "The internetarchive library does not provide assistance with complying with rate limiting." # 😤
THE NUCLEAR OPTION

Delete the Official Library

robotdad's decision: 784 lines gone. Amplifier rebuilt it from scratch.

Custom Rate Limiter

  • Base delay: 3.0 seconds
  • Max delay: 120 seconds
  • On error: delay × 2.5
  • On HTTP 429: delay × 4.0
  • Ban detection: Exit after 5 consecutive 429s

The Result

Stable downloads. No more bans. The pipeline ran for hours without incident.

Sometimes you need to go lower-level than the official library.

CRISIS #2

The Pagination Wall

The API says there are 2.3M results. But we can only get 250 items.

What the API Reports

2.3M

numFound: 2,346,892
Total results available

What We Can Actually Get

250

Page 1-5: 50 items each
Page 6: 0 items ← Hard wall

THE DEBUGGING MARATHON

Hours of Trying Everything

"I'm very angry right now."
"Are we overcomplicating the search? What is going on here?"
"This worked at some point. Cool cool cool."

— robotdad, during the debugging session

THE BREAKTHROUGH

Discovery: A Different API

Internet Archive has TWO completely separate systems.

Search API

Nice web interface
Great for browsing
Hard 10K pagination limit
We hit wall at 250

❌ Not for bulk access

Bulk Export API

Direct item access
No pagination theatrics
Reliable metadata fetch
Works every time

✅ Built for scale

# The bulk export endpoint https://archive.org/metadata/{identifier} # Just give it an ID, get full metadata back. Every time.
THE SOLUTION

Pattern Enumeration

Don't rely on search pagination. Enumerate systematically.

1. Discovery

Use search API to find initial identifiers and collection patterns

2. Enumeration

Generate identifiers systematically: date-based scanning, sequential patterns

3. Bulk Fetch

Use bulk export API to get metadata for each identifier (with custom rate limiter)

# Example: Newspaper identifier patterns sim_american-journal-of-science_1800-01-01_1_1 sim_american-journal-of-science_1800-02-01_1_2 sim_american-journal-of-science_1800-03-01_1_3 # Collection prefix + date + volume + issue # Beautiful. Predictable. Enumeratable.

Production Pipeline: Catalog → Filter → Download

Selectivity and discipline, not just downloading everything

1. Discovery & Catalog

2.3M

Items cataloged from Internet Archive

2. Quality Filtering

🔍

Applied quality criteria: completeness, OCR quality, metadata

3. Final Download

~1.4M

High-quality documents selected for processing

Multi-day stable downloads
Zero HTTP 429 errors
No human intervention needed
HOW WE WORKED TOGETHER

Human-AI Partnership on Real Engineering

robotdad's Role

  • Strategic decisions: "Delete the official library"
  • Direction: "Try pattern enumeration instead"
  • Debugging intuition: Knowing what to question next
  • Frustration that drove breakthroughs: Questioning assumptions when stuck

Amplifier's Role

  • Implementation: Writing the custom rate limiter code
  • Discovery: Finding bulk export API documentation
  • Testing: Rapidly trying different approaches
  • Execution: Building the production pipeline

Neither alone would have solved this as efficiently.
The collaboration was the breakthrough.

WHAT AMPLIFIER CAN DO

Real Production Engineering Through Conversation

🔧 Build Custom Infrastructure

When libraries don't cut it, build from scratch. Custom rate limiters, adaptive backoff, production error handling.

🔍 Research & Discover

Find obscure API documentation, dig through GitHub issues, surface technical solutions you didn't know existed.

⚡ Rapid Iteration

Try dozens of approaches in hours. Test theories immediately. Get from "it doesn't work" to "production-stable" fast.

🏗️ Architectural Pivots

When the approach isn't working, redesign. Pattern enumeration, bulk APIs, hybrid strategies—whatever it takes.

📊 Production Scale

Not toy projects. 2.3M items. Multi-day pipelines. Real error handling. Actual production systems.

🤝 True Collaboration

You make the calls, Amplifier executes. Debugging marathons together. Your intuition + AI implementation.

WHAT'S NEXT

The Story Continues

📄 Part 2: OCR at Scale

Next story coming soon:

  • Rust layer for 46× speedup
  • 143 million text corrections
  • Processing 2.3M historical documents
  • Another human-AI engineering collaboration

🧠 Part 3: Training the Timecapsule LLM

The ultimate goal:

  • Training on pre-WWI data only
  • An AI that doesn't know about modern tech
  • 19th century language and perspective
  • Full training story when complete

This download infrastructure was just the beginning.

SOURCES

Research Methodology

Data as of: February 2026

Feature status: Active — project is open source on GitHub

Research performed:

Gaps: Exact commit counts and PR history not researched for this deck. Document counts (~2.3M cataloged, ~1.4M selected) are from the project's own reporting.

Primary contributors: robotdad (Brian Krabach)

EXPLORE & TRY

See the Code

The entire project is open source

timecapsule-data repository

github.com/robotdad/timecapsule-data

✅ Custom rate limiter with adaptive backoff
✅ Bulk export API integration
✅ Pattern enumeration strategies
✅ Production-tested error handling

Curious what you can build with Amplifier?
This was built through conversational development sessions.

More Amplifier Stories