amplifier-bundle-terminal-tester can launch TUI apps, send keystrokes, and capture screen state. But it didn't know how to reach into a Digital Twin Universe environment, and it needed improvements along the way.
Terminal-tester with PTY emulation and screen-dump modes. Could spawn local apps, send keys, take screenshots.
DTU connectivity — running TUI apps inside a Digital Twin container and testing them from the outside. Plus general reliability improvements for real-world TUI interaction.
Installed the OpenAI Codex CLI into a DTU and wrote acceptance tests to exercise it.
Covering launch, message/response, /status, /statusline picker, and session resume with memory verification.
Each test describes actions (what to do) and expectations (what the screen should show). The terminal-tester agent interprets and executes them.
The first run passed all 7 tests, but the journey was painful.
Spawned with exec <id> bash (extra arg). Then 6 iterations debugging quoting: which codex && codex --version failed, bash -c failed, flags consumed by DTU exec itself.
After "What is 2+2?" the screen appeared frozen. 4 screenshot/sleep cycles (~110s of waiting). Codex had actually responded but needed {ENTER} to confirm.
The /statusline picker required arrow-key navigation. Multiple {ENTER} → screenshot → {ENTER} → screenshot cycles (~6 extra iterations) before discovering {DOWN} + space.
Ran codex instead of codex resume, creating a new empty session. Had to quit, relaunch, navigate a 3-session picker. 16 iterations, ~4 minutes.
amplifier-digital-twin exec argument parsing. No more bash wrapper or flag collisions.
{ENTER}, {DOWN}, {SPACE} into the PTY session. Reliable TUI interaction.
wait_for_text instead of sleep-and-screenshot loops.
validation-dispatcher agent (which lacked the delegate tool) with direct validator steps in the pipeline. Eliminated the delegation failure entirely.
Both modes the agent has today are imperfect. Text loses detail; the current screenshots don't render exactly like the real terminal.
A text rendering of the terminal screen via pyte VT100 emulation — misses animations and color.
PNG captures via pyte rendering — closer to what a human sees, but still doesn't render exactly like the real terminal.
Codex message/response captured as PNG inside the DTU.
Follow-up: Better visual fidelity overall — neither mode is enough today. Also: better routing between browser-tester, terminal-tester, and generic validators.
Directory-based acceptance tests that scale.
Real-world use cases produce many acceptance tests, often organized by feature rather than kept in a single flat file. Reality Check now handles both shapes.
Reality Check expected a single acceptance-tests file. Fine for hand-written test specs, but didn't scale when tests were grouped into feature folders.
The pipeline now accepts a directory of acceptance tests. It iterates through each file, running the full validation flow per feature.
Why it matters: Reality Check now works for both small hand-crafted test suites and large auto-generated ones.
An independent, deployment-level validator that complements unit tests, TDD loops, and pre-commit gates with a non-LLM acceptance-test oracle.
Runs acceptance criteria directly against the deployed system as a non-LLM oracle. Pass/fail is decided by observable behavior, not agent judgment.
Covers TUIs via terminal-tester, web UIs via browser-tester, and CLIs directly.
Consumes a directory of acceptance tests — one file per feature — so feature-spec ACs drive validation directly, without needing a separate QA-area definition step.
And it all runs inside a Digital Twin — Incus-based isolation with DNS rewriting, port forwarding, and external service passthrough. Reality Check validates software in an environment closer to real deployment.
Purpose-built testing for the Amplifier ecosystem.
Made to simplify using Gitea + Digital Twin Universe for Amplifier testing scenarios. Future replacement for the shadow-env bundle. See amplifier-bundle-amplifier-tester.
Gathers context, classifies repo changes, mirrors to Gitea, dynamically generates DTU profiles, launches and verifies environments.
Runs targeted validation checks inside DTUs with tested exec patterns.
Open question: Future replacement — install requirements before it can fully replace shadow-env.
File ops, streaming, mDNS, and a broader vision.
Push files into the environment and pull files out. Plus provision.files profile key for initial file provisioning at launch time.
Real-time stdout/stderr passthrough. No more waiting for the full command to finish before seeing output.
Example profile demonstrating how to use DTU with private GitHub repositories.
localhost becomes myapp.local. Human-friendly hostnames for environments.
Large repos went from minutes to seconds with better defaults for mirroring.
Evolved from a tool for testing/validating software to also supporting longer-living agent dev environments and larger-scale evaluation.
Reality Check validates TUIs end-to-end. Amplifier Tester replaces shadow-env with Gitea + DTU. Digital Twin Universe now supports file ops, streaming, mDNS, and private repos.