Story 2 of 2

“How Do You Know?”

Why one question was worth more
than three rounds of blind testing

March 2026 · Validation Case Study

The Backstory

Canvas-specialists is a bundle of 6 AI specialist agents for structured knowledge work — a Researcher, a Formatter, a Data Analyzer, a Competitive Analyst, a Writer, and the newest addition: the Storyteller.

The Storyteller transforms analytical findings into narrative. It selects a framework, assigns narrative roles to findings, and writes prose that makes people care — with a full editorial record of what it included and why.

Story 1 covered the design.
This is what happened when we tried
to prove it works.

Story 1: Building the Storyteller

Act 1 — The Test

Prove it. Blind.

A controlled A/B/C comparison. One frozen evidence pack — 8 documents, line-numbered, with strict [Doc X:LN-LN] citation rules. Three pipeline variants producing strategy memos for the same scenario: on-device SLM integration for a cross-functional audience.

Randomized X/Y/Z labels. Two independent blind judges per run. Neither judge ever saw which variant produced which memo.

Variant A

Writer only. No analysis, no narrative. Raw evidence in, memo out. The baseline.

Variant B

Analyzer → Writer. Structured analysis feeding the Writer. No storytelling step.

Variant C

Analyzer → Storyteller → Writer. The full pipeline with narrative framing.

Five rubric criteria, each scored 1–5: Evidence Discipline, Decision Clarity, Narrative Memorability, Actionability, Risk Honesty.

The Results

Three runs. Three different winners.

Run 1 35.5 34.0 46.5

Run 2 42.5 33.0 41.5

Run 3 44.5 46.5 43.5

Every variant won once.
That’s noise, not signal.

Act 2 — The Statistics

Cross-run aggregation: 6 scorecards per variant, 18 data points.
Only one criterion reached statistical significance:

p=.011

Narrative Memorability

4.83

Storyteller Score (of 5)

3.67

Writer-only Score (of 5)

Cohen’s d > 2.0 — a massive effect size.
On every other criterion — evidence, decisions, actionability, risk —
all three variants were statistically indistinguishable.

The Deeper Signal

The Storyteller never dropped below 20.5

C — Storyteller

Mean: 21.92 (88%)
CV: 5.7%
Highest mean. Tightest confidence interval. Most consistent performer.

A — Writer-only

Mean: 20.42 (82%)
CV: 10.7%
Competitive scores look like generation luck. No structural advantage.

B — Analyzer

Mean: 18.92 (76%)
CV: 18.2%
Swings 6.5 points between runs. Too unstable for a pipeline default.

But with only 6 scorecards per variant, we’d need ~26 to reliably detect these differences.
“Not significant” means “underpowered,” not “no effect.”

The Realization

We spent more time evaluating the Storyteller than fixing the problem the first test already identified.

Run 1 told us the answer: the Storyteller adds narrative lift but bleeds precision when its output feeds the Writer. Narrative framing becomes factual claims.

Runs 2 and 3 generated noise because n=6 is underpowered. Three rounds of blind testing to characterize a problem that one round already named.

Act 3 — The Honest Review

Stop testing the Storyteller.
Review everything.

Instead of running more blind tests, the question shifted: how reliable is the entire specialist ecosystem? An honest audit of all 6 specialists and every chain combination — not just the new one.

6 Specialists

Each rated on format compliance, content quality, guardrail adherence, and test coverage. Honest scores, not aspirational ones.

8 Chains

Every real-world combination stack-ranked by expected usage. Scored on reliability and output quality.

The Finding

The #1 most-used chain was the most fragile. The Storyteller wasn’t the biggest problem.

The Real Problem

The chain every user reaches for first:

Researcher

→

Writer

The Researcher’s output format was unreliable — only ~1 in 3 topics produced canonical output. The Formatter that normalizes it existed but was listed as “optional but recommended.” The Writer received malformed input and dropped structural blocks.

Everyone was focused on the new specialist.
The old ones were the ones breaking.

The Fixes

Stack-rank by usage.
Fix from the top.

Fix #1

Formatter always in path. New Rule 5: the coordinator must always route through the Formatter when Researcher output feeds any downstream specialist. Invisible to the user.

Fix #2

Writer hardening. Vague audiences default to “technical decision-maker” instead of inflating. Structural blocks now mandatory regardless of input format.

Fix #3

Citation-backed comparisons. Competitive Analysis matrix now carries evidence, source URL, and tier on every line.

All 311 tests passing. Committed and pushed. Three chains projected to jump from fragile to solid.

“Projected.”

Act 4 — The Turning Point

“How do you know they are all 3 star?
Is this confirmed through testing
or theoretical?”

Five words that changed the session.

The Honest Answer

Theoretical.

The ratings are my projection based on what the instruction changes should do — not confirmed through live testing.

LLM instruction adherence is not guaranteed. A rule that says “always insert the Formatter” might be followed — or it might not. A Writer that’s told to produce structural blocks on non-canonical input was told that before, too. And it failed.

There was only one way to know: run the chains and look at the output.

The Evidence

Two live smoke tests. The results split.

Chain #1: Researcher → Writer ✓

Every checkpoint hit. Formatter auto-inserted. Canonical format, 9 findings traced through to 9 S-numbered claims in the final brief. 347 words within budget. All structural blocks present.

Genuinely fixed.

Chain #3: Comp Analysis → Writer ✗

Good substance, completely wrong format. Produced markdown tables and narrative prose instead of the pipe-delimited matrix. Zero source URLs in the output. The instruction changes didn’t work.

Still broken.

The Correction

The scorecard was rewritten. Not with projections — with evidence.

Researcher → Writer Solid Solid ✓

Full Research Pipeline Solid Solid ✓

CompAnal → Writer Solid Fragile ✗

Storyteller → Writer Unknown Fragile ✗

An instruction-only fix that looks right
is not the same as one that works.

Act 5 — The Reflection

“Did we overcomplicate it?
Is it any good?”

The Storyteller itself is good. The process around evaluating it got overcomplicated. We ran four rounds of blind testing to answer a question that one round already answered.

The problem was never the specialist. The problem was the handoff — and we’d known that since Run 1.

Act 6 — The Stitch Fix

One missing input type.

The Writer recognized researcher-output, analyst-output, analysis-output, and raw-notes. It had no idea what to do with story-output. So when Storyteller output arrived, the Writer parsed narrative prose as factual claims.

That’s the precision bleed. Not a deep design flaw. A missing case in a switch statement.

Analyzer

→

Storyteller

→

Writer (story-output)

The rule: Source claims from the Storyteller’s INCLUDED FINDINGS block, not from the story prose. The Storyteller already emits a clean list of every finding it used and its narrative role.

“If you can’t trace it to an INCLUDED FINDING, cut it — even if it sounds good.”

Act 7 — The Proof

The narrative traps.

The Storyteller’s prose contained these. The Writer had to resist every one.

“Apple, Google, and Samsung have already answered that”

Editorial claim

Actual finding

“Major OEMs committed” — not “already answered”

“The privacy posture is genuinely strong”

Evaluative judgment

Actual finding

“Material advantage” — not “genuinely strong”

“The hybrid architecture resolves the core tension”

Narrative framing

Actual finding

“Recommended” — not “resolves”

“These are not suggestions; they are prerequisites”

Dramatic emphasis

Actual finding

“Hard gates” — not “prerequisites”

Confirmed

10 traps planted. 10 excluded.

10/10

Narrative Traps Excluded

15/15

Claims from Findings Only

24/24

Statements Traced

362

Words (Budget: 250–400)

Not projected. Not theoretical. Confirmed.

The Writer used the Storyteller’s narrative arc
for structure — and sourced every fact
from INCLUDED FINDINGS.

By the Numbers

One session

6

Specialists Audited

8

Chains Scored

3

Blind Runs

18

Scorecards Aggregated

1

Significant Finding (p=.011)

10/10

Narrative Traps Excluded

311

Tests Passing

5

Words That Changed Everything

Sources

Sources & Methodology

Data as of: March 6, 2026

Type: Validation case study from Amplifier session

Project: canvas-specialists (specialist agents bundle)

Blind testing protocol:

3 blind runs, 2 independent judges per run, 3 variants per run = 18 scorecards
Frozen evidence pack: 8 documents (Doc A–H), line-numbered, strict citation rules
Randomized X/Y/Z labels — judges never saw variant names
5 rubric criteria, each scored 1–5
Same scenario across all runs: on-device SLM integration strategy memo

Statistical methods:

Kruskal-Wallis omnibus test (p=0.352 overall, p=0.011 Narrative Memorability)
Cohen’s d > 2.0 for Storyteller vs. Writer-only on Narrative criterion
Power analysis: ~26 scorecards per variant needed for reliable overall detection
Coefficient of variation used for consistency comparison

Artifacts:

PR #16 merged to main (commit 9bbb3ba) — chain reliability + story-output fix
Blind test artifacts on feat/blind-abc-test branch (commit 4fe5214)
Chain smoke tests executed live during session
Stitch test: 10 narrative traps, 15 S-numbered claims, 24 traced statements

Gaps & open items:

Competitive Analysis format compliance — confirmed insufficient, not yet re-addressed
Single-domain test scenario — all runs used the same topic
Chain smoke tests not independently re-verified

Primary contributor: Chris Park (product direction, all decisions) with Amplifier AI (analysis, implementation, testing)

The Storyteller’s narrative lift is
statistically real. p = 0.011.

The stitch problem that took
three blind rounds to characterize
was solved with one new input type.

Not because of more testing.

Because someone asked
“how do you know?”
and got an honest answer.

Story 1: Building the Storyteller covers the design.

“How Do You Know?”

Prove it. Blind.

Three runs. Three different winners.

The Storyteller never dropped below 20.5

Stop testing the Storyteller.Review everything.

Stack-rank by usage.Fix from the top.

Theoretical.

One missing input type.

The narrative traps.

10 traps planted. 10 excluded.

One session

Sources & Methodology

Stop testing the Storyteller.
Review everything.

Stack-rank by usage.
Fix from the top.