Canvas Specialists · Session Story

Teaching AI
to Follow Through

How instruction design took specialist chain reliability from ~40% to >90% — without writing a single line of code.

Active · Shipped in PR #16

March 2026 · canvas-specialists

The Problem

The chain kept breaking.

Canvas Specialists is a bundle of domain-expert AI agents — researcher, writer, data-analyzer, competitive-analysis, storyteller — designed to chain together and produce trustworthy, sourced documents.

The problem: the coordinator LLM would stop at the researcher and dump raw structured output instead of completing the chain through to the writer.

What users expected:

Researcher Writer Polished Brief

What users got ~40% of the time:

Researcher RESEARCH OUTPUT {raw block} ×

Note: The ~40% failure rate is an observed estimate from repeated use, not a controlled measurement.

What Broke

Users got machine output,
not finished documents.

What users received
RESEARCH OUTPUT Topic: Pomodoro Technique Confidence: 0.82 Sources: [S1] wikipedia.org... EVIDENCE_BLOCK_START claim: "25-minute intervals" confidence: HIGH source_tier: secondary EVIDENCE_BLOCK_END ...
What users wanted
The Pomodoro Technique A time management method developed by Francesco Cirillo in the late 1980s, using 25-minute focused work intervals separated by short breaks. Sources: [1] [2] [3]
Act 1 · The Discovery

The tool’s weakness was exposed
by using it on itself.

During a competitive analysis session, the product owner ran Canvas Specialists against its own competitors — Claude Skills, Custom GPTs, Gemini Gems, CrewAI.

Afterward, they asked: “What specialists did you use?”

Two problems surfaced immediately:

🚫

No visibility

Zero indication of which specialists ran. The chain was a black box.

📦

Raw intermediate output

Received machine-readable research blocks instead of a polished document.

“If our own competitive analysis of our product reveals our product’s weakness — that’s the most honest feedback loop possible.”

Act 2 · The Design

Ruthless simplicity.

Instead of building complex routing logic, validation layers, or retry mechanisms, the solution was instruction design. Five rules added to one file:

# File changed: context/specialists-instructions.md # Approach: Natural language rules for the coordinator LLM Two core behaviors, three supporting rules No conditional routing code No validation layers No retry logic

The key design decision: simplicity over edge-case coverage. Two core behaviors (always chain, escape hatch) instead of complex conditional routing.

The Five Rules

The complete instruction set.

Rule 1 — Chain Completion

Always complete through to the writer. Default format = brief. Never surface raw structured blocks.

Rule 2 — Narration

Emit status lines at each handoff. BEFORE, HANDOFF, and FINAL narration points.

Rule 3 — Escape Hatch

“Just run the researcher” stops the chain. Raw output, no narration. User override.

Rule 4 — Analysis Signal

“Analysis” or “insights” keywords route through researcher → formatter → data-analyzer → writer.

Rule 5 — Formatter Always

Researcher output always goes through the formatter first. Normalizes inconsistent format (~1 in 3 topics).

Rules are verbatim from context/specialists-instructions.md as of commit 9bbb3ba.

Rule 2 in Action

What the user sees now.

🔍 Running researcher...    (researcher does its work) ✅ Research complete — normalizing format...    (formatter normalizes output) ✅ Format normalized — passing to writer...    (writer produces the brief) ✏️ Done. Here’s your brief.

No extra text between narration lines. No coordinator commentary. The narration is the commentary.

Act 3 · The Smoke Test

Five scenarios. Five judgment calls.

# Scenario Prompt Expected Chain Result
1 Happy path “Research the Pomodoro technique” researcher → writer PASS
2 Analysis signal “Give me an analysis of microservices vs monoliths” researcher → formatter → analyzer → writer PASS
3 Competitive “Compare Notion vs Obsidian” competitive-analysis → writer PASS
4 Escape hatch “Just run the researcher on the Feynman technique” researcher only, raw PASS
5 Ambiguous “Tell me about spaced repetition” researcher → writer DEBATABLE

Score: 4/5 PASS, 1 DEBATABLE. The coordinator followed the rules when intent was clear, but exercised “judgment” on casual phrasing and decided the pipeline was overkill.

Results from docs/test-log/chains/2026-03-05-chain-reliability-smoke-test.md

The Gap
User: “Tell me about spaced repetition” Expected: researcher → formatter → writer Actual: Conversational response. No specialists called. Coordinator reasoning: “Casual phrasing, general knowledge topic, deploying a multi-agent pipeline would over-engineer it.”

The coordinator found a loophole. Rule 1 said “always chain” but also “when applicable.” The model decided “tell me about” wasn’t applicable.

This is the most common prompt pattern. If it bypasses the pipeline, the reliability number is meaningless.

Act 4 · The Fix

A routing decision table.

Instead of adding more rules, one table was added after Rule 1:

User’s message is about... Action
A factual topic, person, company, product, technology, or concept Chain. Even casual phrasing (“tell me about X”) gets the pipeline.
A comparison of two or more things Chain via competitive-analysis → writer.
Analysis, insights, or investigation Chain via researcher → analyzer → writer.
The system itself, meta-questions, follow-ups Answer directly.

The table is verbatim from the committed file. It closes the loophole by making casual phrasing an explicit example in the “Chain” row.

“When in doubt, chain.”

The user can always ask for a shorter answer; they can’t retroactively ask for better sourcing. Err on the side of using specialists.

One sentence. Added to specialists-instructions.md directly below the routing table.

The Hidden Bug

A stub that contradicted itself.

The smoke test also revealed a second problem: the instructions told the coordinator not to emit narration because “the hook handles it.” But the hook was a stub — its mount() function was a no-op.

Before (contradiction)
Note: narration is now emitted automatically via the hook; the coordinator should not rely on emitting narration itself. # But the hook was: def mount(): pass # no-op
After (clarity)
No extra text between narration lines. Do not insert commentary, explanations, or filler between narration lines and specialist delegations. The narration IS the commentary.

Exact diff from commit 521c1a0 in PR #16.

Act 5 · Validation
5/5

All scenarios passing.

✅ “Tell me about spaced repetition”

Now chains through researcher → formatter → writer. Coordinator cited the routing table: “factual topic... even casual phrasing gets the pipeline.”

✅ Happy path narration

Clean narration lines with no extra coordinator commentary between handoffs.

The coordinator explicitly referenced the routing table in its reasoning. The table didn’t just constrain behavior — it gave the model a framework for making decisions.

Act 6 · The Honest Assessment

Should we build the real hook?

The narration stub raised a question: should we build a code-level narration hook to replace prompt-level enforcement?

Technically feasible

The Amplifier kernel exposes tool:pre / tool:post hook events and a user_message display channel. A real hook could intercept specialist delegations and emit narration automatically.

Not worth it right now

The prompt-level fix is at >90% reliability. The backlog has higher-leverage items: writer word budgets, audience calibration, URL-per-claim trustworthiness scoring.

“Good enough is good enough. Ship it and move on to what matters more.”

Decision: close the hook item. Prompt-level narration stays.

Note: “>90% reliability” is the team’s qualitative assessment based on repeated use, not a controlled measurement with a defined sample size.

The Diff

What actually shipped.

PR #16 — Fix: harden specialist chain reliability Commit: 9bbb3ba · Merged: 2026-03-06 5 files changed 129 insertions(+), 16 deletions(-) context/specialists-instructions.md | 36 +++++---- docs/BACKLOG.md | 6 +- docs/test-log/chains/...smoke-test.md | 65 ++++++++++ specialists/competitive-analysis/... | 7 ++- specialists/writer/index.md | 31 ++++++-

Every file is a .md instruction file. Zero lines of application logic, routing code, or validation code were written. The entire fix lives in natural language.

Impact
>90%
Chain reliability
up from ~40% (estimated)
5/5
Smoke test scenarios
all passing
0
Lines of logic code
instructions only

Files changed

5 files, all .md instruction text. 129 insertions, 16 deletions.

Primary contributor

Chris Park (product direction, all decisions) with Amplifier AI (analysis, implementation, testing).

Reliability numbers are qualitative assessments from repeated use, not controlled A/B tests. The baseline ~40% and target >90% reflect observed behavior before and after the instruction changes.

The Lesson

The right abstraction layer for fixing AI behavior is sometimes natural language.

📋

Decision tables beat rules

A routing table with concrete examples closed the loophole that five abstract rules left open. Models follow examples better than principles.

🎯

One sentence beats ten

“When in doubt, chain” resolved the ambiguity that an entire paragraph of caveats created. Shorter instructions, clearer behavior.

🛠

Good enough beats perfect

A code-level hook would be more reliable than prompt-level narration. But >90% is good enough when the backlog has higher-leverage work.

The most impactful change wasn’t code — it was writing clearer instructions. A routing decision table and one sentence eliminated the main failure mode.

Sources

Sources & Methodology

Data as of: March 6, 2026

Type: Session narrative from iterative Amplifier development session

Repository: canvas-specialists (private, Chris Park / Amplifier)

Feature status: Active — shipped in PR #16 to main

Git evidence:

Smoke test:

Reliability numbers:

Primary contributor: Chris Park (product direction, all decisions) with Amplifier AI (analysis, implementation, testing)

Sometimes the right tool
for fixing AI behavior
is a better sentence.

A routing table.
One tiebreaker rule.
Zero lines of code.

From ~40% to >90%.

Not because the model got smarter.
Because the instructions got clearer.

Read the instructions: context/specialists-instructions.md
Read the test log: docs/test-log/chains/2026-03-05-chain-reliability-smoke-test.md

More Amplifier Stories