A Real Story

Your Teammates Already
Solved Half Your Problem

You just didn't know it yet.

Team Knowledge — April 2026

The Starting Point

“I need to build an eval system that measures whether team-knowledge actually helps people work.”

Clear goal. Specific scope. Before designing anything from scratch — search what the team already has.

What Usually Happens

The default workflow is expensive

Design from scratch

You don't know what exists, so you build everything yourself. Weeks of work that might duplicate a teammate's.

Ask around

“Hey, does anyone have anything for eval?” — Post in Teams. Hope the right person sees it before it scrolls away.

Schedule a sync

Book time with 3–4 people to compare approaches. Wait for calendars to align. Context-switch everyone involved.

What Happened Instead

Search first. Design second.

1,300+

capabilities indexed
across the team

16

teammates' work
searchable

5

eval-related tools
surfaced

Three targeted searches surfaced existing work from multiple teammates. The design conversation started with what already exists — not what to build from zero.

The Turning Point

“How do I build an eval system?”

“Which of my teammates' work
do I compose?”

Discovery 1

MJ

sage-eval

Rubric-based LLM scoring with exactly the right interface. Takes a session, applies scoring criteria, returns structured results. Built for a different project — but the scoring engine is general-purpose.

✓ Use directly

Discovery 2

Manoj

session-drift-report

A two-phase prompt pattern: gather evidence first, then score. Prevents the LLM judge from deciding a verdict and backfilling justification. The recipe itself was wrong scope for eval — but the prompt architecture is gold.

✓ Steal the pattern

Discovery 3

David

amplifier-tester

Digital Twin Universe (DTU) — isolated containers that spin up a complete Amplifier environment for testing. The hardest infrastructure problem for eval — running truly isolated A/B sessions with no filesystem bleed-through — was already solved.

✓ Use directly

Discovery 4

David

reality-check-pipeline

Looked relevant at first glance — “reality check” sounds like evaluation. After investigation: wrong use case entirely. It validates prompt outputs against known ground truth, not session-level quality.

✗ Don't use

An honest “no” is just as valuable. It saved days that would have been wasted trying to adapt an incompatible tool.

The Result

Three teammates' work. One eval system.

MJ

Scoring engine

Grades each session on 5 dimensions

Manoj

Prompt pattern

Evidence first, prevents rationalization

David

Isolation infrastructure

Clean A/B environments via DTU

↓

The Eval System

Run paired sessions — with KB vs. without — score on 5 dimensions, and produce a clear verdict.

KB_HELPED KB_NEUTRAL KB_HURT

The Real Metric

0

meetings
scheduled

0

Teams threads asking
“does anyone have…”

0

interruptions to
contributors

MJ, Manoj, and David weren't context-switched. Their work was discovered and composed without interrupting anyone's day.

Why This Matters

Two audiences. One insight.

Work compounds instead of being reinvented.

Three teammates' prior work was leveraged into a new design — no duplication, no wasted effort. When a team's knowledge is searchable, every investment in tooling pays dividends beyond the original project.

Your work gets found.

Build something good and your teammates will build on it — even when you don't know it's happening. You don't need to memorize everyone's repos. The knowledge base knows them for you.

The Recursive Proof

This brainstorming session proved
the very thing it set out to measure.

The eval system design measures whether team knowledge changes how people work. In this session, team knowledge surfaced 3 teammates' work and turned a build-from-scratch design into a composition. The question was already answered before any code was written.

Sources & Methodology

How we got these numbers

Data as of: April 2026

Feature status: Active

Basis: A real brainstorming session conducted using team-knowledge search capabilities.

Data sources:

1,300+ capabilities and 16 teammates from the team-knowledge index at time of session
5 eval-related tools surfaced from 3 targeted searches across the knowledge base
Tool names, contributor names, and verdict decisions are from the actual session record

Gaps: Exact capability count varies as teams publish new work. The 1,300+ figure is approximate, reflecting the index state during the session.

Get Started

Build on what your
teammates built.

Search before you design from scratch.
The next breakthrough might already exist in a teammate's repo.

team_knowledge(operation="search", query="...")

MADE team: microsoft/amplifier-bundle-team-knowledge-base

Your Teammates AlreadySolved Half Your Problem

The default workflow is expensive

Design from scratch

Ask around

Schedule a sync

Search first. Design second.

Three teammates' work. One eval system.

The Eval System

Two audiences. One insight.

Work compounds instead of being reinvented.

Your work gets found.

This brainstorming session provedthe very thing it set out to measure.

How we got these numbers

Build on what yourteammates built.

Your Teammates Already
Solved Half Your Problem

This brainstorming session proved
the very thing it set out to measure.

Build on what your
teammates built.