Why one question was worth more
than three rounds of blind testing
Canvas-specialists is a bundle of 6 AI specialist agents for structured knowledge work — a Researcher, a Formatter, a Data Analyzer, a Competitive Analyst, a Writer, and the newest addition: the Storyteller.
The Storyteller transforms analytical findings into narrative. It selects a framework, assigns narrative roles to findings, and writes prose that makes people care — with a full editorial record of what it included and why.
Story 1 covered the design.
This is what happened when we tried
to prove it works.
A controlled A/B/C comparison. One frozen evidence pack — 8 documents, line-numbered, with strict [Doc X:LN-LN] citation rules. Three pipeline variants producing strategy memos for the same scenario: on-device SLM integration for a cross-functional audience.
Randomized X/Y/Z labels. Two independent blind judges per run. Neither judge ever saw which variant produced which memo.
Writer only. No analysis, no narrative. Raw evidence in, memo out. The baseline.
Analyzer → Writer. Structured analysis feeding the Writer. No storytelling step.
Analyzer → Storyteller → Writer. The full pipeline with narrative framing.
Five rubric criteria, each scored 1–5: Evidence Discipline, Decision Clarity, Narrative Memorability, Actionability, Risk Honesty.
Every variant won once.
That’s noise, not signal.
Cross-run aggregation: 6 scorecards per variant, 18 data points.
Only one criterion reached statistical significance:
Cohen’s d > 2.0 — a massive effect size.
On every other criterion — evidence, decisions, actionability, risk —
all three variants were statistically indistinguishable.
Mean: 21.92 (88%)
CV: 5.7%
Highest mean. Tightest confidence interval. Most consistent performer.
Mean: 20.42 (82%)
CV: 10.7%
Competitive scores look like generation luck. No structural advantage.
Mean: 18.92 (76%)
CV: 18.2%
Swings 6.5 points between runs. Too unstable for a pipeline default.
But with only 6 scorecards per variant, we’d need ~26 to reliably detect these differences.
“Not significant” means “underpowered,” not “no effect.”
We spent more time evaluating the Storyteller than fixing the problem the first test already identified.
Run 1 told us the answer: the Storyteller adds narrative lift but bleeds precision when its output feeds the Writer. Narrative framing becomes factual claims.
Runs 2 and 3 generated noise because n=6 is underpowered. Three rounds of blind testing to characterize a problem that one round already named.
Instead of running more blind tests, the question shifted: how reliable is the entire specialist ecosystem? An honest audit of all 6 specialists and every chain combination — not just the new one.
Each rated on format compliance, content quality, guardrail adherence, and test coverage. Honest scores, not aspirational ones.
Every real-world combination stack-ranked by expected usage. Scored on reliability and output quality.
The #1 most-used chain was the most fragile. The Storyteller wasn’t the biggest problem.
The chain every user reaches for first:
The Researcher’s output format was unreliable — only ~1 in 3 topics produced canonical output. The Formatter that normalizes it existed but was listed as “optional but recommended.” The Writer received malformed input and dropped structural blocks.
Everyone was focused on the new specialist.
The old ones were the ones breaking.
Formatter always in path. New Rule 5: the coordinator must always route through the Formatter when Researcher output feeds any downstream specialist. Invisible to the user.
Writer hardening. Vague audiences default to “technical decision-maker” instead of inflating. Structural blocks now mandatory regardless of input format.
Citation-backed comparisons. Competitive Analysis matrix now carries evidence, source URL, and tier on every line.
All 311 tests passing. Committed and pushed. Three chains projected to jump from fragile to solid.
“Projected.”
“How do you know they are all 3 star?
Is this confirmed through testing
or theoretical?”
Five words that changed the session.
The ratings are my projection based on what the instruction changes should do — not confirmed through live testing.
LLM instruction adherence is not guaranteed. A rule that says “always insert the Formatter” might be followed — or it might not. A Writer that’s told to produce structural blocks on non-canonical input was told that before, too. And it failed.
There was only one way to know: run the chains and look at the output.
Two live smoke tests. The results split.
Every checkpoint hit. Formatter auto-inserted. Canonical format, 9 findings traced through to 9 S-numbered claims in the final brief. 347 words within budget. All structural blocks present.
Genuinely fixed.
Good substance, completely wrong format. Produced markdown tables and narrative prose instead of the pipe-delimited matrix. Zero source URLs in the output. The instruction changes didn’t work.
Still broken.
The scorecard was rewritten. Not with projections — with evidence.
An instruction-only fix that looks right
is not the same as one that works.
“Did we overcomplicate it?
Is it any good?”
The Storyteller itself is good. The process around evaluating it got overcomplicated. We ran four rounds of blind testing to answer a question that one round already answered.
The problem was never the specialist. The problem was the handoff — and we’d known that since Run 1.
The Writer recognized researcher-output, analyst-output, analysis-output, and raw-notes. It had no idea what to do with story-output. So when Storyteller output arrived, the Writer parsed narrative prose as factual claims.
That’s the precision bleed. Not a deep design flaw. A missing case in a switch statement.
The rule: Source claims from the Storyteller’s INCLUDED FINDINGS block, not from the story prose. The Storyteller already emits a clean list of every finding it used and its narrative role.
“If you can’t trace it to an INCLUDED FINDING, cut it — even if it sounds good.”
The Storyteller’s prose contained these. The Writer had to resist every one.
“Apple, Google, and Samsung have already answered that”
Editorial claim
Actual finding
“Major OEMs committed” — not “already answered”
“The privacy posture is genuinely strong”
Evaluative judgment
Actual finding
“Material advantage” — not “genuinely strong”
“The hybrid architecture resolves the core tension”
Narrative framing
Actual finding
“Recommended” — not “resolves”
“These are not suggestions; they are prerequisites”
Dramatic emphasis
Actual finding
“Hard gates” — not “prerequisites”
Not projected. Not theoretical. Confirmed.
The Writer used the Storyteller’s narrative arc
for structure — and sourced every fact
from INCLUDED FINDINGS.
Data as of: March 6, 2026
Type: Validation case study from Amplifier session
Project: canvas-specialists (specialist agents bundle)
Blind testing protocol:
Statistical methods:
Artifacts:
feat/blind-abc-test branch (commit 4fe5214)Gaps & open items:
Primary contributor: Chris Park (product direction, all decisions) with Amplifier AI (analysis, implementation, testing)
The Storyteller’s narrative lift is
statistically real. p = 0.011.
The stitch problem that took
three blind rounds to characterize
was solved with one new input type.
Not because of more testing.
Because someone asked
“how do you know?”
and got an honest answer.
Story 1: Building the Storyteller covers the design.