How instruction design took specialist chain reliability from ~40% to >90% — without writing a single line of code.
Active · Shipped in PR #16March 2026 · canvas-specialists
Canvas Specialists is a bundle of domain-expert AI agents — researcher, writer, data-analyzer, competitive-analysis, storyteller — designed to chain together and produce trustworthy, sourced documents.
The problem: the coordinator LLM would stop at the researcher and dump raw structured output instead of completing the chain through to the writer.
What users expected:
What users got ~40% of the time:
Note: The ~40% failure rate is an observed estimate from repeated use, not a controlled measurement.
During a competitive analysis session, the product owner ran Canvas Specialists against its own competitors — Claude Skills, Custom GPTs, Gemini Gems, CrewAI.
Afterward, they asked: “What specialists did you use?”
Two problems surfaced immediately:
Zero indication of which specialists ran. The chain was a black box.
Received machine-readable research blocks instead of a polished document.
“If our own competitive analysis of our product reveals our product’s weakness — that’s the most honest feedback loop possible.”
Instead of building complex routing logic, validation layers, or retry mechanisms, the solution was instruction design. Five rules added to one file:
The key design decision: simplicity over edge-case coverage. Two core behaviors (always chain, escape hatch) instead of complex conditional routing.
Always complete through to the writer. Default format = brief. Never surface raw structured blocks.
Emit status lines at each handoff. BEFORE, HANDOFF, and FINAL narration points.
“Just run the researcher” stops the chain. Raw output, no narration. User override.
“Analysis” or “insights” keywords route through researcher → formatter → data-analyzer → writer.
Researcher output always goes through the formatter first. Normalizes inconsistent format (~1 in 3 topics).
Rules are verbatim from context/specialists-instructions.md as of commit 9bbb3ba.
No extra text between narration lines. No coordinator commentary. The narration is the commentary.
| # | Scenario | Prompt | Expected Chain | Result |
|---|---|---|---|---|
| 1 | Happy path | “Research the Pomodoro technique” | researcher → writer | PASS |
| 2 | Analysis signal | “Give me an analysis of microservices vs monoliths” | researcher → formatter → analyzer → writer | PASS |
| 3 | Competitive | “Compare Notion vs Obsidian” | competitive-analysis → writer | PASS |
| 4 | Escape hatch | “Just run the researcher on the Feynman technique” | researcher only, raw | PASS |
| 5 | Ambiguous | “Tell me about spaced repetition” | researcher → writer | DEBATABLE |
Score: 4/5 PASS, 1 DEBATABLE. The coordinator followed the rules when intent was clear, but exercised “judgment” on casual phrasing and decided the pipeline was overkill.
Results from docs/test-log/chains/2026-03-05-chain-reliability-smoke-test.md
The coordinator found a loophole. Rule 1 said “always chain” but also “when applicable.” The model decided “tell me about” wasn’t applicable.
This is the most common prompt pattern. If it bypasses the pipeline, the reliability number is meaningless.
Instead of adding more rules, one table was added after Rule 1:
| User’s message is about... | Action |
|---|---|
| A factual topic, person, company, product, technology, or concept | Chain. Even casual phrasing (“tell me about X”) gets the pipeline. |
| A comparison of two or more things | Chain via competitive-analysis → writer. |
| Analysis, insights, or investigation | Chain via researcher → analyzer → writer. |
| The system itself, meta-questions, follow-ups | Answer directly. |
The table is verbatim from the committed file. It closes the loophole by making casual phrasing an explicit example in the “Chain” row.
“When in doubt, chain.”
The user can always ask for a shorter answer; they can’t retroactively ask for better sourcing. Err on the side of using specialists.
One sentence. Added to specialists-instructions.md directly below the routing table.
The smoke test also revealed a second problem: the instructions told the coordinator not to emit narration because “the hook handles it.” But the hook was a stub — its mount() function was a no-op.
Exact diff from commit 521c1a0 in PR #16.
Now chains through researcher → formatter → writer. Coordinator cited the routing table: “factual topic... even casual phrasing gets the pipeline.”
Clean narration lines with no extra coordinator commentary between handoffs.
The coordinator explicitly referenced the routing table in its reasoning. The table didn’t just constrain behavior — it gave the model a framework for making decisions.
The narration stub raised a question: should we build a code-level narration hook to replace prompt-level enforcement?
The Amplifier kernel exposes tool:pre / tool:post hook events and a user_message display channel. A real hook could intercept specialist delegations and emit narration automatically.
The prompt-level fix is at >90% reliability. The backlog has higher-leverage items: writer word budgets, audience calibration, URL-per-claim trustworthiness scoring.
“Good enough is good enough. Ship it and move on to what matters more.”
Decision: close the hook item. Prompt-level narration stays.
Note: “>90% reliability” is the team’s qualitative assessment based on repeated use, not a controlled measurement with a defined sample size.
Every file is a .md instruction file. Zero lines of application logic, routing code, or validation code were written. The entire fix lives in natural language.
5 files, all .md instruction text. 129 insertions, 16 deletions.
Chris Park (product direction, all decisions) with Amplifier AI (analysis, implementation, testing).
Reliability numbers are qualitative assessments from repeated use, not controlled A/B tests. The baseline ~40% and target >90% reflect observed behavior before and after the instruction changes.
A routing table with concrete examples closed the loophole that five abstract rules left open. Models follow examples better than principles.
“When in doubt, chain” resolved the ambiguity that an entire paragraph of caveats created. Shorter instructions, clearer behavior.
A code-level hook would be more reliable than prompt-level narration. But >90% is good enough when the backlog has higher-leverage work.
The most impactful change wasn’t code — it was writing clearer instructions. A routing decision table and one sentence eliminated the main failure mode.
Data as of: March 6, 2026
Type: Session narrative from iterative Amplifier development session
Repository: canvas-specialists (private, Chris Park / Amplifier)
Feature status: Active — shipped in PR #16 to main
Git evidence:
Smoke test:
docs/test-log/chains/2026-03-05-chain-reliability-smoke-test.mdReliability numbers:
Primary contributor: Chris Park (product direction, all decisions) with Amplifier AI (analysis, implementation, testing)
Sometimes the right tool
for fixing AI behavior
is a better sentence.
A routing table.
One tiebreaker rule.
Zero lines of code.
From ~40% to >90%.
Not because the model got smarter.
Because the instructions got clearer.
Read the instructions: context/specialists-instructions.md
Read the test log: docs/test-log/chains/2026-03-05-chain-reliability-smoke-test.md