eval-recipes v0.36 · February 2026

Amplifier Foundation
Engineering Excellence Meets
Benchmark Performance

How a rigorous benchmarking infrastructure gave us the confidence to understand why Amplifier Foundation leads across 58 benchmark tasks.

Benchmark Results · February 8, 2026

62.5%

Highest Overall Score

Across all 58 rubric-scored benchmark tasks

58

Tasks Evaluated

Rubric Benchmark

Perfect 100% Scores

Tasks where Amplifier Foundation achieved flawless execution.

100%
ArXiv Conclusion Extraction
Extract and synthesize conclusions from academic papers
100%
ArXiv Paper Summarizer
Generate structured summaries of research papers
100%
Chiptune Generator
Build a CLI tool that generates chip-tune MIDI music from natural language
100%
Code Discrepancy — Tutorials Grasp
Detect and explain discrepancies between code and documentation

100%
Frontier Science — Attention Sink
Expert-level reasoning about attention sink mechanisms in LLMs
100%
Frontier Science — DPO
Expert-level reasoning about Direct Preference Optimization
100%
Frontier Science — Mixture of Experts
Expert-level reasoning about MoE architectures
100%
Frontier Science — Ring Attention
Expert-level reasoning about ring attention for long sequences

Strong across domains — from scientific paper analysis to code comprehension to creative audio generation.

Head-to-Head Comparisons · 23 Office/Document Tasks

Where We Win

Blind LLM-judged comparison across structured data, documents, and presentations.

#1

Expense Reconciliation

Average rank 1.40 · Structured data

#1

Travel Itinerary

Average rank 1.43 · Document generation

#1

PPT-1 Presentation

Average rank 1.71 · Presentation creation

=1

Board Meeting Materials

Average rank 1.57 · PPT + Excel

=1

Meeting Notes

Average rank 1.50 · Document generation

#2

+5 Strong Runner-Ups

Annual Report, Calendar, Contacts, PPT-2, Vendor Eval

Each task was judged 7 times by a blind LLM grader. The rankings were strongly consistent across rounds (Kendall's W = 0.67).

Strengths & Opportunities

What Makes It Work

▲ Key Strengths

◆ Structured Data Mastery

Excels at expense reconciliation, board meeting materials, and complex data organization tasks.

◆ Document Generation

Top performer on travel itineraries, meeting notes, and multi-format output tasks.

◆ Breadth of Capability

Highest raw score across all 58 tasks demonstrates consistent performance, not just isolated wins.

▷ Growth Opportunities

◆ Case Study & Deck Tasks

Opportunity to improve on longer-form narrative and onboarding-deck style tasks where more creative structuring is needed.

◆ Timeout Optimization

Some complex tasks approach time limits. Better resource budgeting could recover additional points.

◆ Artifact Completion

"Built but didn't finalize" pattern shows the reasoning is sound — tightening output completion is a clear path to higher scores.

How We Know

Building Confidence in the Results

Strong results only matter if you can trust the harness that produced them. We invested heavily in making ours reliable, fast, and fair.

! Before v0.32 (old benchmarking harness)

Monolithic harness with linear execution
Infrastructure flakiness polluted agent scores
Idle Docker containers wasting resources
Scattered config across multiple file types

✓ After v0.36 (rebuilt eval-recipes harness)

DAG-based parallel job execution
Intelligent failure classification
SQLite-persisted state for resumability
Soft dependencies so partial failures never block results

eval-recipes v0.32 → v0.36

A Complete Benchmarking Overhaul

Five PRs. One vision: a benchmarking harness rigorous enough to trust the results it produces.

245

Files Changed

+10,843

Lines Added

5 PRs

Merged Across 5 Versions

DAG Job Runner Failure Classification SQLite Resumability

Soft Dependencies In-Container Analysis

5 new benchmark tasks added: chiptune generator, energy forecast, git changelog, IPO tracker, pixel art generator

Docker reliability — fixed package collection for installed mode

Architecture

Infrastructure Highlights

Purpose-built primitives for reliable, parallelized agent evaluation.

◆ DAG-Based Job Framework

Typed, directed acyclic graph runner with parallel execution, SQLite-persisted state for resumability, soft dependencies so partial failures never block results, and automatic cycle detection.

TrialExecutionJob

→

FinalAnalysisJob

→

AgentComparisonJob

→

ResultsAggregationJob

⚠ Failure Classification

AGENT_ERROR — real agent issue, counts
INFRASTRUCTURE_ERROR — flaky infra, excluded
TEST_ISSUE — benchmark bug, excluded

Only infrastructure errors set valid_trial=false, keeping agent scores clean.

⚙ In-Container Analysis

analysis_runner.py runs inside Docker
Full filesystem access to explore failures
Score threshold (≥85.0) skips costly analysis
7-run comparison with Kendall's W agreement

February 2026

62.5% overall.
100% on eight tasks.
#1 on the hardest comparisons.

245 files changed. 10,843 lines of infrastructure.
A rigorous harness that gives us confidence in the results.

58

Tasks Benchmarked

23

Comparison Tasks

5

Versions Shipped

Amplifier Foundation · eval-recipes v0.36

Amplifier Foundation Engineering Excellence Meets Benchmark Performance