amplifier-bundle-reality-check

Reality Check

When AI says "done," is it actually done?
A verification pipeline that closes the gap between
"tests pass" and "this actually works."

The Problem

AI-generated software gets verified
in the same context it was built in.

Context Poisoning

Agents claim things work because the build conversation says so. They never test from a fresh perspective.

Deployment Blind Spots

Issues go unsolved because agents never consider deployment details, dependencies, or real-world setup.

Tests Pass != Works

"Tests pass" and "this actually works for a real user" are two very different statements.

Manual Verification Still Required

The agent says it's done but verification still falls on you. "Done" becomes "please check this for me."

The Vision

Five requirements for "done means done"

1
Understand user intention
The machine figures out what to test -- no manually written test plans.
2
Build the environment itself
No manual configuration. Deploy into an isolated Digital Twin Universe automatically.
3
Run validation autonomously
Real browser-based testing against the deployed app -- clicks, typing, screenshots.
4
"Done" means done
You get a working demo and evidence report, not a checklist or a claim.
5
General, not bespoke
A standard capability any agent can use. Not tied to a specific project or framework.
The Build

From vision to working pipeline

Synthesized the evidence-based testing vision
Scaffolded bundle, agents, and architecture
Built Intent Analyzer and Report agents
Created E2E test harness with injectable bugs
Assembled the 4-step pipeline recipe
Validated pipeline catches injected bugs
Polish, cleanup, and fixture refinement
The Pipeline

Four steps from intent to evidence

1
Intent Analysis
Reads specs, conversation history, and feedback to produce structured acceptance tests
2
DTU Launch
Deploys the software in an isolated Digital Twin Universe environment
3
Browser Testing
Drives real Chromium against the running app -- clicks, types, screenshots
4
Report
Produces gap analysis YAML and self-contained HTML artifact with evidence
Input

User spec, conversation history, the software repo -- the same artifacts that already exist from the build session. No additional setup required.

Output

Structured report.yaml, self-contained report.html with embedded screenshots and verdict, and a running DTU environment you can open in a browser and explore.

Orchestrated as a flat Amplifier recipe -- each step's output flows into the next via templated variables. Runnable as a single command or step-by-step.

The Agents

Three specialists, one pipeline

Intent Analyzer

"What does done mean?"

Reads specs, conversation history, and feedback to derive structured acceptance tests in YAML. Classifies software type (web app, CLI, API, library) and assigns must/should/nice priorities. Explicit requirements map to user statements; implicit requirements are inferred from context. Unknowns become documented assumptions.

Browser Tester

Real browser, real evidence

Drives Chromium via agent-browser CLI using accessibility-tree refs. Navigates, fills inputs, clicks buttons, waits for network idle, and takes screenshots at every checkpoint. Polls for interactive elements instead of fixed sleeps. One explicit PASS/FAIL/ERROR/SKIP per acceptance test -- no summarizing as "likely works."

Report

Gap analysis + self-contained HTML artifact

Fuzzy-matches validator results to acceptance tests. Produces report.yaml (structured data) and a self-contained report.html with embedded base64 screenshots, color-coded verdict banner, test-by-test results table, and gap analysis. Verdict logic: any must failure = fail, gaps remaining = partial, all clear = pass.

Architecture

Data flow: from user intent to verification report

reality_check_architecture amplifier-bundle-reality-check cluster_user_interactionsUser Interactions cluster_reality_checkamplifier-bundle-reality-check cluster_validatorsValidators cluster_dtuamplifier-bundle-digital-twin-universe specSpec / requirements intentIntent Analyzeragent spec->intent conversationAgent conversationhistory conversation->intent feedbackUser feedback /clarifications feedback->intent softwareSoftware(repo, directory, artifact) profile_builderDTU Profile Builderagent software->profile_builderbuilds profilefrom criteriaVerificationCriteria intent->criteriaproduces browser_testerBrowser Testeragent criteria->browser_tester other_validatorsOther validators(CLI, API, ...) criteria->other_validators reporterReportagent browser_tester->reporterresults other_validators->reporterresults reportReality CheckReport reporter->reportproduces visualUser-facingverification artifact(visual, dashboard, ...) reporter->visualgenerates dtu_envDigital TwinEnvironment profile_builder->dtu_envlaunches dtu_env->browser_testertests against dtu_env->other_validatorstests against

Green = Reality Check agents  •  Blue = Digital Twin Universe  •  Yellow = Data artifacts  •  Red = Output artifacts  •  Dashed = Future/extensible

Testing Infrastructure

Built to prove itself

A verification pipeline needs to be verifiable. The bundle ships with everything needed to test itself.

Synthetic Test Fixtures

A 79-line user spec and 25-turn build conversation for Amplifier Chat -- realistic enough to stress-test intent analysis.

Covers: chat interface, session history, pinning, slash commands, health endpoints, streaming, error handling.

Injectable Bug Patches

Two intentionally subtle bugs that can be injected with a --with-bugs flag:

pin-persistence: wrong JSON key breaks save/load
send-noop: && to || silently blocks sending

E2E Playground Script

One command sets up a self-contained test directory: clones a real app at a pinned commit, copies fixtures, optionally injects bugs. Ready to run the full pipeline against.

Step-by-Step Playground Guide

Documentation for manually testing each pipeline stage in isolation -- intent analysis, DTU launch, browser testing, report generation -- for development and debugging.

Validation

The pipeline catches intentionally injected bugs

The bundle ships with an E2E playground that clones a real app and injects subtle bugs via --with-bugs to validate the full pipeline end to end.

Example: send-noop bug injection

One of the injectable patches flips a guard condition from && to ||, silently blocking all text-only messages. The browser tester catches this because it actually types a message and clicks send.

// Guard: don't send if no content AND no images - if (!content && pendingImages.length === 0) return; + if (!content || pendingImages.length === 0) return;

This is exactly the kind of subtle, silent regression that passes unit tests but fails real-world use.

FAIL
Pipeline detects the bug
Fix applied, DTU restarted
Pipeline re-run
PASS
All tests pass
Results

E2E validation run against Amplifier Chat

Run against a version with intentionally injected bugs via setup-e2e-playground.sh --with-bugs to validate the pipeline catches real failures.

Test Priority Status Evidence
Chat page loads must pass Title "Amplifier Chat", interactive in 6s
Message input and send button visible must pass Input @e5, button @e3 found via snapshot
Send a message and get response must fail doSend() guard blocks text-only sends
Session history sidebar must pass Sidebar renders with conversation list
Pin a conversation should pass Pin icon visible, toggles state
Health endpoint returns 200 must pass GET /chat/health returns JSON
Streaming response renders progressively should pass SSE chunks arrive, DOM updates live
Slash commands (/help, /clear) should pass /help shows command list
7/8
Tests passed (after fix: 8/8)
24
Screenshots captured
PARTIAL
Verdict (before fix)
Key Decisions

Design choices that shaped the bundle

Separate bundle, not a DTU feature SEPARATION OF CONCERNS

Reality Check depends on DTU but lives in its own bundle. Verification is a distinct capability -- any environment provider could back it.

"Acceptance tests" not "contracts" SIMPLICITY

The initial "contract" abstraction was too unfamiliar. Mapping to acceptance tests -- a known software engineering concept -- made the pipeline legible.

DTU left running after pipeline USER EXPERIENCE

The environment stays up so the user can explore the deployed app themselves -- a working demo, not just a report.

One result row per acceptance test EXHAUSTIVENESS

Early runs showed the browser tester skipping criteria. Now every acceptance test must have an explicit PASS, FAIL, ERROR, or SKIP -- no summarizing as "likely works."

"Done" should mean done.

Reality Check turns "I think it works" into "here's the evidence."

A verification pipeline that autonomously validates what was built
against what was asked for.

Intent
What does done mean?
Environment
Deploy it for real
Evidence
Prove it works
amplifier-bundle-reality-check v0.1.0
More Amplifier Stories