eval-recipes

Comparison benchmarking, streamlined interfaces,
and enhanced debugging capabilities

v0.0.28
v0.0.29
v0.0.31
Active

Four Releases,
Major Progress

v0.0.28
Comparison
Side-by-side agent benchmarking
v0.0.29
Interfaces
Simplified APIs & docs
v0.0.31
Debugging
Log context for semantic tests

Key themes: New comparison capability • Improved developer experience • Code quality improvements • Enhanced debugging

v0.0.28

Comparison-Based
Benchmarking

Evaluate multiple AI agents on the same tasks and rank them relative to each other using an LLM judge.

Harness
Parallel Execution
Each agent completes tasks independently in isolated Docker containers
Evaluation
Blind Comparison
Judge agent evaluates anonymized outputs, ranking agents from best to worst
Reporting
Rich HTML Reports
Interactive dashboards with rankings, win rates, and Kendall's W agreement scores
PR #36 28 files changed, +3,479 lines
v0.0.28

How Comparison Benchmarking Works

01
Trial Execution
Agents complete tasks
02
Blind Comparison
Judge evaluates anonymized outputs
03
Aggregation
Multi-trial for consistency
Avg Rank
Mean position across runs (lower is better)
Win Rate
% of runs ranked #1
Task Wins
Tasks with best avg rank
Kendall's W
Inter-rater agreement (0-1)

Multi-trial consistency: Each comparison runs multiple times to reduce variance. Kendall's W >= 0.33 indicates significant results; lower values suggest similar agent outputs.

v0.0.29

Streamlined
Interfaces

Major documentation overhaul and simplified harness interface for better developer experience.

Documentation
BENCHMARKING.md Overhaul
Complete rewrite with 474 lines of updated guidance, examples, and best practices for running benchmarks.
BENCHMARKING.md prepare-release.md AGENTS.md
Simplified
Harness Interface Refactor
320 lines refactored for cleaner, more intuitive API. Easier to configure and extend.
harness.py score-default.yaml comparison-default.yaml
PR #39 18 files changed, +1,563 / -1,199 lines
v0.0.29

Less Code,
More Value

Sometimes the best code is the code you delete.

Removed
filters.py
  • 150 lines of filtering logic
  • Unused task filtering functions
  • Complex conditional branches
  • Dead code maintenance burden
Result
Cleaner Codebase
  • ✓ Reduced cognitive load
  • ✓ Fewer files to navigate
  • ✓ No unused dependencies
  • ✓ Simpler test surface

Net change: -364 lines across v0.0.29 — proving that refactoring toward simplicity delivers real value.

v0.0.31

Semantic Test
Log Context

Pass agent execution logs directly into semantic tests for richer evaluation and debugging.

Capture
Agent logs are captured during trial execution and stored in structured format
Pass Through
Logs are automatically included in semantic test context via updated schemas
Analyze
LLM evaluator has full context to understand agent behavior and failures
# Agent logs are now available in semantic tests class SemanticTestContext: task_description: str expected_outcome: str actual_result: str agent_logs: Optional[List[LogEntry]] # NEW! # Debug failures with full visibility into agent's execution
PR #44 10 files changed, +344 lines

By the Numbers

3
Releases
6
PRs Merged
56+
Files Changed
5K+
Lines Added
Features Shipped
Comparison Benchmarking v0.0.28
Simplified Harness Interface v0.0.29
Semantic Test Log Context v0.0.31
Improvements
Documentation Overhaul v0.0.29
Dead Code Removal v0.0.29

Research Methodology

Data as of: February 20, 2026

Feature status: Active

Research performed:

  • Local search: find ~/dev/ANext -maxdepth 2 -name "*eval*" - no local eval-recipes repo found
  • PR and release data sourced from existing deck content (PR #36, #39, #44)
  • Line counts and file changes from PR metadata in existing deck

Gaps: eval-recipes repo not available locally for independent verification. PR numbers and line counts carried forward from prior deck version. Exact commit dates not independently verified.

Repository: microsoft/eval-recipes (GitHub)

Primary contributors: Not independently verified from local data

Try It Out

All features are available now in v0.0.31

# Install the latest version uv add "eval-recipes @ git+https://github.com/microsoft/eval-recipes@v0.0.31" # Run comparison benchmarks uv run eval_recipes.benchmark compare \ --agents agent_a agent_b \ --tasks ppt-1 ppt-2 ppt-3 # Run with log context for debugging uv run eval_recipes.benchmark run \ --task my_task \ --with-log-context
Documentation
Updated BENCHMARKING.md guide
Examples
PowerPoint tasks (ppt-1, ppt-2, ppt-3)
Support
AGENTS.md for contribution guidelines
View on GitHub →
1 / 10
More Amplifier Stories