Release Progress
eval-recipes
Comparison benchmarking, streamlined interfaces,
and enhanced debugging capabilities
v0.0.28
→
v0.0.29
→
v0.0.31
Active
Overview
Four Releases,
Major Progress
v0.0.28
Comparison
Side-by-side agent benchmarking
v0.0.29
Interfaces
Simplified APIs & docs
v0.0.31
Debugging
Log context for semantic tests
Key themes: New comparison capability • Improved developer experience •
Code quality improvements • Enhanced debugging
v0.0.28
Comparison-Based
Benchmarking
Evaluate multiple AI agents on the same tasks and rank them relative to each other using an LLM judge.
Harness
Parallel Execution
Each agent completes tasks independently in isolated Docker containers
Evaluation
Blind Comparison
Judge agent evaluates anonymized outputs, ranking agents from best to worst
Reporting
Rich HTML Reports
Interactive dashboards with rankings, win rates, and Kendall's W agreement scores
PR #36
28 files changed, +3,479 lines
v0.0.28
How Comparison Benchmarking Works
01
Trial Execution
Agents complete tasks
→
02
Blind Comparison
Judge evaluates anonymized outputs
→
03
Aggregation
Multi-trial for consistency
Avg Rank
Mean position across runs (lower is better)
Win Rate
% of runs ranked #1
Task Wins
Tasks with best avg rank
Kendall's W
Inter-rater agreement (0-1)
Multi-trial consistency: Each comparison runs multiple times to reduce variance.
Kendall's W >= 0.33 indicates significant results; lower values suggest similar agent outputs.
v0.0.29
Streamlined
Interfaces
Major documentation overhaul and simplified harness interface for better developer experience.
Documentation
BENCHMARKING.md Overhaul
Complete rewrite with 474 lines of updated guidance, examples, and best practices for running benchmarks.
BENCHMARKING.md
prepare-release.md
AGENTS.md
Simplified
Harness Interface Refactor
320 lines refactored for cleaner, more intuitive API. Easier to configure and extend.
harness.py
score-default.yaml
comparison-default.yaml
PR #39
18 files changed, +1,563 / -1,199 lines
v0.0.29
Less Code,
More Value
Sometimes the best code is the code you delete.
Removed
filters.py
- 150 lines of filtering logic
- Unused task filtering functions
- Complex conditional branches
- Dead code maintenance burden
Result
Cleaner Codebase
- ✓ Reduced cognitive load
- ✓ Fewer files to navigate
- ✓ No unused dependencies
- ✓ Simpler test surface
Net change: -364 lines across v0.0.29 — proving that refactoring toward simplicity delivers real value.
v0.0.31
Semantic Test
Log Context
Pass agent execution logs directly into semantic tests for richer evaluation and debugging.
Capture
Agent logs are captured during trial execution and stored in structured format
Pass Through
Logs are automatically included in semantic test context via updated schemas
Analyze
LLM evaluator has full context to understand agent behavior and failures
class SemanticTestContext:
task_description: str
expected_outcome: str
actual_result: str
agent_logs: Optional[List[LogEntry]]
PR #44
10 files changed, +344 lines
Velocity
By the Numbers
Features Shipped
Comparison Benchmarking
v0.0.28
Simplified Harness Interface
v0.0.29
Semantic Test Log Context
v0.0.31
Improvements
Documentation Overhaul
v0.0.29
Dead Code Removal
v0.0.29
Sources
Research Methodology
Data as of: February 20, 2026
Feature status: Active
Research performed:
- Local search:
find ~/dev/ANext -maxdepth 2 -name "*eval*" - no local eval-recipes repo found
- PR and release data sourced from existing deck content (PR #36, #39, #44)
- Line counts and file changes from PR metadata in existing deck
Gaps: eval-recipes repo not available locally for independent verification. PR numbers and line counts carried forward from prior deck version. Exact commit dates not independently verified.
Repository: microsoft/eval-recipes (GitHub)
Primary contributors: Not independently verified from local data
What's Next
Try It Out
All features are available now in v0.0.31
uv add "eval-recipes @ git+https://github.com/microsoft/eval-recipes@v0.0.31"
uv run eval_recipes.benchmark compare \
--agents agent_a agent_b \
--tasks ppt-1 ppt-2 ppt-3
uv run eval_recipes.benchmark run \
--task my_task \
--with-log-context
Documentation
Updated BENCHMARKING.md guide
Examples
PowerPoint tasks (ppt-1, ppt-2, ppt-3)
Support
AGENTS.md for contribution guidelines