The amplifier-app-benchmarks Story
A complete regression suite built with Amplifier itself, powered by Microsoft's battle-tested eval-recipes framework.
| Test | What It Validates |
|---|---|
| Provider Response | AI providers (Anthropic, OpenAI, Gemini) respond correctly |
| Bash Execution | Shell commands execute and return results |
| Agent Delegation | Task tool spawns and coordinates sub-agents |
| Web Search | Search capabilities find relevant, recent results |
| Web Fetch | Content retrieval from URLs works correctly |
| Recipe Listing | Recipe system discovers and lists available recipes |
| PDF Extraction | Reading and extracting content from PDF documents |
| AGENTS.md Injection | AGENTS.md files are properly loaded into context |
| Bundle Context | Bundle composition and context injection work correctly |
Prerequisites: Python 3.11+, Docker, and API keys for your provider(s)
Data as of: February 20, 2026
Feature status: Active
Research performed:
gh repo view DavidKoleczek/amplifier-app-benchmarks — confirmed active, last updated 2026-02-10Gaps: No local clone to verify exact test count, line counts, or license. "Hours" development time is qualitative, not measured. MIT license claim from repo — not independently verified.
Primary contributors: David Koleczek (repo owner)
Run the regression suite or extend it with your own tests