How a rigorous benchmarking infrastructure gave us the confidence to understand why Amplifier Foundation leads across 58 benchmark tasks.
Tasks where Amplifier Foundation achieved flawless execution.
Strong across domains — from scientific paper analysis to code comprehension to creative audio generation.
Blind LLM-judged comparison across structured data, documents, and presentations.
Each task was judged 7 times by a blind LLM grader. The rankings were strongly consistent across rounds (Kendall's W = 0.67).
Strong results only matter if you can trust the harness that produced them. We invested heavily in making ours reliable, fast, and fair.
Five PRs. One vision: a benchmarking harness rigorous enough to trust the results it produces.
Purpose-built primitives for reliable, parallelized agent evaluation.
Only infrastructure errors set valid_trial=false, keeping agent scores clean.
analysis_runner.py runs inside Docker
245 files changed. 10,843 lines of infrastructure.
A rigorous harness that gives us confidence in the results.
Amplifier Foundation · eval-recipes v0.36