amplifier-bundle-reality-check

Reality Check

When AI says "done," is it actually done?
A verification pipeline that closes the gap between
"tests pass" and "this actually works."

The Problem

AI-generated software gets verified
in the same context it was built in.

Context Poisoning

Agents claim things work because the build conversation says so. They never test from a fresh perspective.

Deployment Blind Spots

Issues go unsolved because agents never consider deployment details, dependencies, or real-world setup.

Tests Pass != Works

"Tests pass" and "this actually works for a real user" are two very different statements.

Manual Verification Still Required

The agent says it's done but verification still falls on you. "Done" becomes "please check this for me."

The Vision

Five requirements for "done means done"

1

Understand user intention

The machine figures out what to test -- no manually written test plans.

2

Build the environment itself

No manual configuration. Deploy into an isolated Digital Twin Universe automatically.

3

Run validation autonomously

Real browser-based testing against the deployed app -- clicks, typing, screenshots.

4

"Done" means done

You get a working demo and evidence report, not a checklist or a claim.

5

General, not bespoke

A standard capability any agent can use. Not tied to a specific project or framework.

The Build

From vision to working pipeline

Synthesized the evidence-based testing vision

Scaffolded bundle, agents, and architecture

Built Intent Analyzer and Report agents

Created E2E test harness with injectable bugs

Assembled the 4-step pipeline recipe

Validated pipeline catches injected bugs

Polish, cleanup, and fixture refinement

The Pipeline

Four steps from intent to evidence

1

Intent Analysis

Reads specs, conversation history, and feedback to produce structured acceptance tests

2

DTU Launch

Deploys the software in an isolated Digital Twin Universe environment

3

Browser Testing

Drives real Chromium against the running app -- clicks, types, screenshots

4

Report

Produces gap analysis YAML and self-contained HTML artifact with evidence

Input

User spec, conversation history, the software repo -- the same artifacts that already exist from the build session. No additional setup required.

Output

Structured report.yaml, self-contained report.html with embedded screenshots and verdict, and a running DTU environment you can open in a browser and explore.

Orchestrated as a flat Amplifier recipe -- each step's output flows into the next via templated variables. Runnable as a single command or step-by-step.

The Agents

Three specialists, one pipeline

Intent Analyzer

"What does done mean?"

Reads specs, conversation history, and feedback to derive structured acceptance tests in YAML. Classifies software type (web app, CLI, API, library) and assigns must/should/nice priorities. Explicit requirements map to user statements; implicit requirements are inferred from context. Unknowns become documented assumptions.

Browser Tester

Real browser, real evidence

Drives Chromium via agent-browser CLI using accessibility-tree refs. Navigates, fills inputs, clicks buttons, waits for network idle, and takes screenshots at every checkpoint. Polls for interactive elements instead of fixed sleeps. One explicit PASS/FAIL/ERROR/SKIP per acceptance test -- no summarizing as "likely works."

Report

Gap analysis + self-contained HTML artifact

Fuzzy-matches validator results to acceptance tests. Produces report.yaml (structured data) and a self-contained report.html with embedded base64 screenshots, color-coded verdict banner, test-by-test results table, and gap analysis. Verdict logic: any must failure = fail, gaps remaining = partial, all clear = pass.

Architecture

Data flow: from user intent to verification report

Green = Reality Check agents • Blue = Digital Twin Universe • Yellow = Data artifacts • Red = Output artifacts • Dashed = Future/extensible

Testing Infrastructure

Built to prove itself

A verification pipeline needs to be verifiable. The bundle ships with everything needed to test itself.

Synthetic Test Fixtures

A 79-line user spec and 25-turn build conversation for Amplifier Chat -- realistic enough to stress-test intent analysis.

Covers: chat interface, session history, pinning, slash commands, health endpoints, streaming, error handling.

Injectable Bug Patches

Two intentionally subtle bugs that can be injected with a --with-bugs flag:

pin-persistence: wrong JSON key breaks save/load
send-noop: && to || silently blocks sending

E2E Playground Script

One command sets up a self-contained test directory: clones a real app at a pinned commit, copies fixtures, optionally injects bugs. Ready to run the full pipeline against.

Step-by-Step Playground Guide

Documentation for manually testing each pipeline stage in isolation -- intent analysis, DTU launch, browser testing, report generation -- for development and debugging.

Validation

The pipeline catches intentionally injected bugs

The bundle ships with an E2E playground that clones a real app and injects subtle bugs via --with-bugs to validate the full pipeline end to end.

Example: send-noop bug injection

One of the injectable patches flips a guard condition from && to ||, silently blocking all text-only messages. The browser tester catches this because it actually types a message and clicks send.

// Guard: don't send if no content AND no images - if (!content && pendingImages.length === 0) return; + if (!content || pendingImages.length === 0) return;

This is exactly the kind of subtle, silent regression that passes unit tests but fails real-world use.

FAIL

Pipeline detects the bug

Fix applied, DTU restarted

Pipeline re-run

PASS

All tests pass

Results

E2E validation run against Amplifier Chat

Run against a version with intentionally injected bugs via setup-e2e-playground.sh --with-bugs to validate the pipeline catches real failures.

Test	Priority	Status	Evidence
Chat page loads	must	pass	Title "Amplifier Chat", interactive in 6s
Message input and send button visible	must	pass	Input @e5, button @e3 found via snapshot
Send a message and get response	must	fail	doSend() guard blocks text-only sends
Session history sidebar	must	pass	Sidebar renders with conversation list
Pin a conversation	should	pass	Pin icon visible, toggles state
Health endpoint returns 200	must	pass	GET /chat/health returns JSON
Streaming response renders progressively	should	pass	SSE chunks arrive, DOM updates live
Slash commands (/help, /clear)	should	pass	/help shows command list

7/8

Tests passed (after fix: 8/8)

24

Screenshots captured

PARTIAL

Verdict (before fix)

Key Decisions

Design choices that shaped the bundle

Separate bundle, not a DTU feature SEPARATION OF CONCERNS

Reality Check depends on DTU but lives in its own bundle. Verification is a distinct capability -- any environment provider could back it.

"Acceptance tests" not "contracts" SIMPLICITY

The initial "contract" abstraction was too unfamiliar. Mapping to acceptance tests -- a known software engineering concept -- made the pipeline legible.

DTU left running after pipeline USER EXPERIENCE

The environment stays up so the user can explore the deployed app themselves -- a working demo, not just a report.

One result row per acceptance test EXHAUSTIVENESS

Early runs showed the browser tester skipping criteria. Now every acceptance test must have an explicit PASS, FAIL, ERROR, or SKIP -- no summarizing as "likely works."

"Done" should mean done.

Reality Check turns "I think it works" into "here's the evidence."

A verification pipeline that autonomously validates what was built
against what was asked for.

Intent

What does done mean?

Environment

Deploy it for real

Evidence

Prove it works

amplifier-bundle-reality-check v0.1.0