When AI says "done," is it actually done?
A verification pipeline that closes the gap between
"tests pass" and "this actually works."
Agents claim things work because the build conversation says so. They never test from a fresh perspective.
Issues go unsolved because agents never consider deployment details, dependencies, or real-world setup.
"Tests pass" and "this actually works for a real user" are two very different statements.
The agent says it's done but verification still falls on you. "Done" becomes "please check this for me."
User spec, conversation history, the software repo -- the same artifacts that already exist from the build session. No additional setup required.
Structured report.yaml, self-contained report.html with embedded screenshots and verdict, and a running DTU environment you can open in a browser and explore.
Orchestrated as a flat Amplifier recipe -- each step's output flows into the next via templated variables. Runnable as a single command or step-by-step.
Reads specs, conversation history, and feedback to derive structured acceptance tests in YAML. Classifies software type (web app, CLI, API, library) and assigns must/should/nice priorities. Explicit requirements map to user statements; implicit requirements are inferred from context. Unknowns become documented assumptions.
Drives Chromium via agent-browser CLI using accessibility-tree refs. Navigates, fills inputs, clicks buttons, waits for network idle, and takes screenshots at every checkpoint. Polls for interactive elements instead of fixed sleeps. One explicit PASS/FAIL/ERROR/SKIP per acceptance test -- no summarizing as "likely works."
Fuzzy-matches validator results to acceptance tests. Produces report.yaml (structured data) and a self-contained report.html with embedded base64 screenshots, color-coded verdict banner, test-by-test results table, and gap analysis. Verdict logic: any must failure = fail, gaps remaining = partial, all clear = pass.
Green = Reality Check agents • Blue = Digital Twin Universe • Yellow = Data artifacts • Red = Output artifacts • Dashed = Future/extensible
A verification pipeline needs to be verifiable. The bundle ships with everything needed to test itself.
A 79-line user spec and 25-turn build conversation for Amplifier Chat -- realistic enough to stress-test intent analysis.
Covers: chat interface, session history, pinning, slash commands, health endpoints, streaming, error handling.
Two intentionally subtle bugs that can be injected with a --with-bugs flag:
pin-persistence: wrong JSON key breaks save/load
send-noop: && to || silently blocks sending
One command sets up a self-contained test directory: clones a real app at a pinned commit, copies fixtures, optionally injects bugs. Ready to run the full pipeline against.
Documentation for manually testing each pipeline stage in isolation -- intent analysis, DTU launch, browser testing, report generation -- for development and debugging.
The bundle ships with an E2E playground that clones a real app and injects subtle bugs via --with-bugs to validate the full pipeline end to end.
One of the injectable patches flips a guard condition from && to ||, silently blocking all text-only messages. The browser tester catches this because it actually types a message and clicks send.
This is exactly the kind of subtle, silent regression that passes unit tests but fails real-world use.
Run against a version with intentionally injected bugs via setup-e2e-playground.sh --with-bugs to validate the pipeline catches real failures.
| Test | Priority | Status | Evidence |
|---|---|---|---|
| Chat page loads | must | pass | Title "Amplifier Chat", interactive in 6s |
| Message input and send button visible | must | pass | Input @e5, button @e3 found via snapshot |
| Send a message and get response | must | fail | doSend() guard blocks text-only sends |
| Session history sidebar | must | pass | Sidebar renders with conversation list |
| Pin a conversation | should | pass | Pin icon visible, toggles state |
| Health endpoint returns 200 | must | pass | GET /chat/health returns JSON |
| Streaming response renders progressively | should | pass | SSE chunks arrive, DOM updates live |
| Slash commands (/help, /clear) | should | pass | /help shows command list |
Reality Check depends on DTU but lives in its own bundle. Verification is a distinct capability -- any environment provider could back it.
The initial "contract" abstraction was too unfamiliar. Mapping to acceptance tests -- a known software engineering concept -- made the pipeline legible.
The environment stays up so the user can explore the deployed app themselves -- a working demo, not just a report.
Early runs showed the browser tester skipping criteria. Now every acceptance test must have an explicit PASS, FAIL, ERROR, or SKIP -- no summarizing as "likely works."
Reality Check turns "I think it works" into "here's the evidence."
A verification pipeline that autonomously validates what was built
against what was asked for.
amplifier-bundle-reality-check v0.1.0