Release Progress

eval-recipes

Comparison benchmarking, streamlined interfaces,
and enhanced debugging capabilities

v0.0.28

→

v0.0.29

→

v0.0.31

Active

Overview

Four Releases,
Major Progress

v0.0.28

Comparison

Side-by-side agent benchmarking

v0.0.29

Interfaces

Simplified APIs & docs

v0.0.31

Debugging

Log context for semantic tests

Key themes: New comparison capability • Improved developer experience • Code quality improvements • Enhanced debugging

v0.0.28

Comparison-Based
Benchmarking

Evaluate multiple AI agents on the same tasks and rank them relative to each other using an LLM judge.

Harness

Parallel Execution

Each agent completes tasks independently in isolated Docker containers

Evaluation

Blind Comparison

Judge agent evaluates anonymized outputs, ranking agents from best to worst

Reporting

Rich HTML Reports

Interactive dashboards with rankings, win rates, and Kendall's W agreement scores

PR #36 28 files changed, +3,479 lines

v0.0.28

How Comparison Benchmarking Works

01

Trial Execution

Agents complete tasks

→

02

Blind Comparison

Judge evaluates anonymized outputs

→

03

Aggregation

Multi-trial for consistency

Avg Rank

Mean position across runs (lower is better)

Win Rate

% of runs ranked #1

Task Wins

Tasks with best avg rank

Kendall's W

Inter-rater agreement (0-1)

Multi-trial consistency: Each comparison runs multiple times to reduce variance. Kendall's W >= 0.33 indicates significant results; lower values suggest similar agent outputs.

v0.0.29

Streamlined
Interfaces

Major documentation overhaul and simplified harness interface for better developer experience.

Documentation

BENCHMARKING.md Overhaul

Complete rewrite with 474 lines of updated guidance, examples, and best practices for running benchmarks.

BENCHMARKING.md prepare-release.md AGENTS.md

Simplified

Harness Interface Refactor

320 lines refactored for cleaner, more intuitive API. Easier to configure and extend.

harness.py score-default.yaml comparison-default.yaml

PR #39 18 files changed, +1,563 / -1,199 lines

v0.0.29

Less Code,
More Value

Sometimes the best code is the code you delete.

Removed

filters.py

150 lines of filtering logic
Unused task filtering functions
Complex conditional branches
Dead code maintenance burden

Result

Cleaner Codebase

✓ Reduced cognitive load
✓ Fewer files to navigate
✓ No unused dependencies
✓ Simpler test surface

Net change: -364 lines across v0.0.29 — proving that refactoring toward simplicity delivers real value.

v0.0.31

Semantic Test
Log Context

Pass agent execution logs directly into semantic tests for richer evaluation and debugging.

Capture

Agent logs are captured during trial execution and stored in structured format

Pass Through

Logs are automatically included in semantic test context via updated schemas

Analyze

LLM evaluator has full context to understand agent behavior and failures

# Agent logs are now available in semantic tests
class SemanticTestContext:
    task_description: str
    expected_outcome: str
    actual_result: str
    agent_logs: Optional[List[LogEntry]]  # NEW!

# Debug failures with full visibility into agent's execution

PR #44 10 files changed, +344 lines

Velocity

By the Numbers

3

Releases

6

PRs Merged

56+

Files Changed

5K+

Lines Added

Features Shipped

Comparison Benchmarking v0.0.28

Simplified Harness Interface v0.0.29

Semantic Test Log Context v0.0.31

Improvements

Documentation Overhaul v0.0.29

Dead Code Removal v0.0.29

Sources

Research Methodology

Data as of: February 20, 2026

Feature status: Active

Research performed:

Local search: find ~/dev/ANext -maxdepth 2 -name "*eval*" - no local eval-recipes repo found
PR and release data sourced from existing deck content (PR #36, #39, #44)
Line counts and file changes from PR metadata in existing deck

Gaps: eval-recipes repo not available locally for independent verification. PR numbers and line counts carried forward from prior deck version. Exact commit dates not independently verified.

Repository: microsoft/eval-recipes (GitHub)

Primary contributors: Not independently verified from local data

What's Next

Try It Out

All features are available now in v0.0.31

# Install the latest version
uv add "eval-recipes @ git+https://github.com/microsoft/eval-recipes@v0.0.31"

# Run comparison benchmarks
uv run eval_recipes.benchmark compare \
    --agents agent_a agent_b \
    --tasks ppt-1 ppt-2 ppt-3

# Run with log context for debugging
uv run eval_recipes.benchmark run \
    --task my_task \
    --with-log-context

Documentation

Updated BENCHMARKING.md guide

Examples

PowerPoint tasks (ppt-1, ppt-2, ppt-3)

Support

AGENTS.md for contribution guidelines

View on GitHub →

eval-recipes

Four Releases,Major Progress

Comparison-BasedBenchmarking

How Comparison Benchmarking Works

StreamlinedInterfaces

Less Code,More Value

Semantic TestLog Context

By the Numbers

Research Methodology

Try It Out

Four Releases,
Major Progress

Comparison-Based
Benchmarking

Streamlined
Interfaces

Less Code,
More Value

Semantic Test
Log Context