Active

HUMAN-AI COLLABORATION • REAL-WORLD ENGINEERING

Taming the Internet Archive

How robotdad & Amplifier built production-scale download infrastructure
for 2.3 million historical documents

A story about API wrestling, rate limit nightmares,
and what's possible with conversational development

THE VISION

Building a Timecapsule LLM

What if you trained an AI that doesn't know about computers, smartphones, or the internet?

The Challenge

Train a language model exclusively on pre-World War I data (before 1914). An AI with the language, knowledge, and perspective of the 19th and early 20th centuries.

What This Requires

A massive corpus of historical documents. Newspapers, books, periodicals—millions of them. All from before 1914. All digitized, downloadable, processable.

STANDING ON SHOULDERS

Inspired by Hayk Grigorian's TimeCapsuleLLM

Proof-of-concept that showed training LLMs on historical data only

The Original TimeCapsuleLLM

Hayk Grigorian pioneered this concept with his TimeCapsuleLLM project:

Scope: London, 1800-1875
Dataset: 136,344 documents (90GB)
Key insight: Train from scratch on historical data only
Repository: github.com/haykgrigo3/TimeCapsuleLLM

Hayk's Proof-of-Concept

136K

Documents, London-focused
Single city, 75-year span

Our Production Scale

2.3M

~15× larger dataset
Global pre-WWI scope
Multiple countries, 114-year span

THE CHALLENGE

The Scale

2.3M

Items to download
from Internet Archive

83K+

Historical newspapers
to catalog

~200K

Books across
various collections

And Internet Archive has completely undocumented rate limits.

CRISIS #1

🚨

HTTP 429

You are being rate limited.

The official Internet Archive library provides zero assistance with rate limiting.

# What happened
12 workers → banned immediately
2 workers → still banned
Added delays → STILL banned
Made delays longer → STILL BANNED

# Found in the docs:
"The internetarchive library does not provide
assistance with complying with rate limiting."

# 😤

THE NUCLEAR OPTION

Delete the Official Library

robotdad's decision: 784 lines gone. Amplifier rebuilt it from scratch.

Custom Rate Limiter

Base delay: 3.0 seconds
Max delay: 120 seconds
On error: delay × 2.5
On HTTP 429: delay × 4.0
Ban detection: Exit after 5 consecutive 429s

The Result

Stable downloads. No more bans. The pipeline ran for hours without incident.

Sometimes you need to go lower-level than the official library.

CRISIS #2

The Pagination Wall

The API says there are 2.3M results. But we can only get 250 items.

What the API Reports

2.3M

numFound: 2,346,892
Total results available

What We Can Actually Get

250

Page 1-5: 50 items each
Page 6: 0 items ← Hard wall

THE DEBUGGING MARATHON

Hours of Trying Everything

Tried AND NOT operators → same wall at 250
Simplified queries → same wall at 250
Changed collection parameters → same wall at 250
Adjusted pagination offsets → same wall at 250
Read the API docs five times → same wall at 250

"I'm very angry right now."

"Are we overcomplicating the search? What is going on here?"

"This worked at some point. Cool cool cool."

— robotdad, during the debugging session

THE BREAKTHROUGH

Discovery: A Different API

Internet Archive has TWO completely separate systems.

Search API

Nice web interface
Great for browsing
Hard 10K pagination limit
We hit wall at 250

❌ Not for bulk access

Bulk Export API

Direct item access
No pagination theatrics
Reliable metadata fetch
Works every time

✅ Built for scale

# The bulk export endpoint
https://archive.org/metadata/{identifier}

# Just give it an ID, get full metadata back. Every time.

THE SOLUTION

Pattern Enumeration

Don't rely on search pagination. Enumerate systematically.

1. Discovery

Use search API to find initial identifiers and collection patterns

2. Enumeration

Generate identifiers systematically: date-based scanning, sequential patterns

3. Bulk Fetch

Use bulk export API to get metadata for each identifier (with custom rate limiter)

# Example: Newspaper identifier patterns
sim_american-journal-of-science_1800-01-01_1_1
sim_american-journal-of-science_1800-02-01_1_2
sim_american-journal-of-science_1800-03-01_1_3

# Collection prefix + date + volume + issue
# Beautiful. Predictable. Enumeratable.

VICTORY

Production Pipeline: Catalog → Filter → Download

Selectivity and discipline, not just downloading everything

1. Discovery & Catalog

2.3M

Items cataloged from Internet Archive

2. Quality Filtering

🔍

Applied quality criteria: completeness, OCR quality, metadata

3. Final Download

~1.4M

High-quality documents selected for processing

✅

Multi-day stable downloads

✅

Zero HTTP 429 errors

✅

No human intervention needed

HOW WE WORKED TOGETHER

Human-AI Partnership on Real Engineering

robotdad's Role

Strategic decisions: "Delete the official library"
Direction: "Try pattern enumeration instead"
Debugging intuition: Knowing what to question next
Frustration that drove breakthroughs: Questioning assumptions when stuck

Amplifier's Role

Implementation: Writing the custom rate limiter code
Discovery: Finding bulk export API documentation
Testing: Rapidly trying different approaches
Execution: Building the production pipeline

Neither alone would have solved this as efficiently.
The collaboration was the breakthrough.

WHAT AMPLIFIER CAN DO

Real Production Engineering Through Conversation

🔧 Build Custom Infrastructure

When libraries don't cut it, build from scratch. Custom rate limiters, adaptive backoff, production error handling.

🔍 Research & Discover

Find obscure API documentation, dig through GitHub issues, surface technical solutions you didn't know existed.

⚡ Rapid Iteration

Try dozens of approaches in hours. Test theories immediately. Get from "it doesn't work" to "production-stable" fast.

🏗️ Architectural Pivots

When the approach isn't working, redesign. Pattern enumeration, bulk APIs, hybrid strategies—whatever it takes.

📊 Production Scale

Not toy projects. 2.3M items. Multi-day pipelines. Real error handling. Actual production systems.

🤝 True Collaboration

You make the calls, Amplifier executes. Debugging marathons together. Your intuition + AI implementation.

WHAT'S NEXT

The Story Continues

📄 Part 2: OCR at Scale

Next story coming soon:

Rust layer for 46× speedup
143 million text corrections
Processing 2.3M historical documents
Another human-AI engineering collaboration

🧠 Part 3: Training the Timecapsule LLM

The ultimate goal:

Training on pre-WWI data only
An AI that doesn't know about modern tech
19th century language and perspective
Full training story when complete

This download infrastructure was just the beginning.

SOURCES

Research Methodology

Data as of: February 2026

Feature status: Active — project is open source on GitHub

Research performed:

Based on first-person development sessions between robotdad and Amplifier
Repository: github.com/robotdad/timecapsule-data

Gaps: Exact commit counts and PR history not researched for this deck. Document counts (~2.3M cataloged, ~1.4M selected) are from the project's own reporting.

Primary contributors: robotdad (Brian Krabach)

EXPLORE & TRY

See the Code

The entire project is open source

timecapsule-data repository

github.com/robotdad/timecapsule-data

✅ Custom rate limiter with adaptive backoff
✅ Bulk export API integration
✅ Pattern enumeration strategies
✅ Production-tested error handling

Curious what you can build with Amplifier?
This was built through conversational development sessions.