Digital Twin Universe, Universe Demos

TUI testing, amplifier-tester, and more

Date April 16, 2026 Author David Koleczek

Part 1

Reality Check for TUIs

End-to-end validation for terminal applications inside Digital Twins.

Reality Check · The Problem

We have a terminal tester. It needs to work with Digital Twins.

amplifier-bundle-terminal-tester can launch TUI apps, send keystrokes, and capture screen state. But it didn't know how to reach into a Digital Twin Universe environment, and it needed improvements along the way.

What existed

Terminal-tester with PTY emulation and screen-dump modes. Could spawn local apps, send keys, take screenshots.

What was missing

DTU connectivity — running TUI apps inside a Digital Twin container and testing them from the outside. Plus general reliability improvements for real-world TUI interaction.

Reality Check · Test Case

Codex CLI as the first TUI in a Digital Twin.

Installed the OpenAI Codex CLI into a DTU and wrote acceptance tests to exercise it.

tests:
  - description: "Sending a message produces an LLM response"
    type: cli
    priority: must
    steps:
      - action: "Type 'What is 2+2?' and press Enter"
        expect: "The model replies with a coherent answer"

  - description: "Session resume with memory"
    type: cli
    priority: must
    steps:
      - action: "Type 'remember the word pineapple'"
      - action: "Exit, run 'codex resume', select session"
      - action: "Ask 'What word did I ask you to remember?'"
        expect: "The model replies with 'pineapple'"

7 acceptance tests

Covering launch, message/response, /status, /statusline picker, and session resume with memory verification.

Each test describes actions (what to do) and expectations (what the screen should show). The terminal-tester agent interprets and executes them.

Reality Check · The Churn

87 iterations. 14 minutes. 51 terminal calls.

The first run passed all 7 tests, but the journey was painful.

Wrong exec command

Spawned with exec <id> bash (extra arg). Then 6 iterations debugging quoting: which codex && codex --version failed, bash -c failed, flags consumed by DTU exec itself.

Frozen screen misread

After "What is 2+2?" the screen appeared frozen. 4 screenshot/sleep cycles (~110s of waiting). Codex had actually responded but needed {ENTER} to confirm.

TUI navigation by trial-and-error

The /statusline picker required arrow-key navigation. Multiple {ENTER} → screenshot → {ENTER} → screenshot cycles (~6 extra iterations) before discovering {DOWN} + space.

Session resume fail

Ran codex instead of codex resume, creating a new empty session. Had to quit, relaunch, navigate a 3-session picker. 16 iterations, ~4 minutes.

Reality Check · The Fixes

From churn to first-try passes.

1 DTU exec semantics — Updated terminal-tester agent instructions with correct amplifier-digital-twin exec argument parsing. No more bash wrapper or flag collisions.

2 Keystroke handling — Fixed how the agent sends special keys like {ENTER}, {DOWN}, {SPACE} into the PTY session. Reliable TUI interaction.

3 Better instruction-following — Improved the terminal-tester agent's guidance for reading acceptance test steps literally and using wait_for_text instead of sleep-and-screenshot loops.

4 Architecture pivot — Replaced the validation-dispatcher agent (which lacked the delegate tool) with direct validator steps in the pipeline. Eliminated the delegation failure entirely.

Result: most Codex acceptance tests now pass on the first try.

Reality Check · What the Agent Sees

Better visual fidelity is next.

Both modes the agent has today are imperfect. Text loses detail; the current screenshots don't render exactly like the real terminal.

Text mode

A text rendering of the terminal screen via pyte VT100 emulation — misses animations and color.

╭───────────────────────────────────────╮
│ >_ OpenAI Codex (v0.120.0)            │
│                                       │
│ model:     gpt-5.4   /model to change │
│ directory: ~                          │
╰───────────────────────────────────────╯

› What is 2+2?

• 4

› Implement {feature}
  gpt-5.4 default · ~

Screenshot mode

PNG captures via pyte rendering — closer to what a human sees, but still doesn't render exactly like the real terminal.

Codex message/response captured as PNG inside the DTU.

Follow-up: Better visual fidelity overall — neither mode is enough today. Also: better routing between browser-tester, terminal-tester, and generic validators.

Part 2

Reality Check at Scale

Directory-based acceptance tests that scale.

Reality Check · Scaling Up

Many acceptance tests, organized by feature.

Real-world use cases produce many acceptance tests, often organized by feature rather than kept in a single flat file. Reality Check now handles both shapes.

Before

Reality Check expected a single acceptance-tests file. Fine for hand-written test specs, but didn't scale when tests were grouped into feature folders.

After

The pipeline now accepts a directory of acceptance tests. It iterates through each file, running the full validation flow per feature.

Why it matters: Reality Check now works for both small hand-crafted test suites and large auto-generated ones.

Reality Check — the reliable and robust QA tester.

An independent, deployment-level validator that complements unit tests, TDD loops, and pre-commit gates with a non-LLM acceptance-test oracle.

Ground-truth oracle

Runs acceptance criteria directly against the deployed system as a non-LLM oracle. Pass/fail is decided by observable behavior, not agent judgment.

TUI, CLI, and web

Covers TUIs via terminal-tester, web UIs via browser-tester, and CLIs directly.

Specs become runnable

Consumes a directory of acceptance tests — one file per feature — so feature-spec ACs drive validation directly, without needing a separate QA-area definition step.

And it all runs inside a Digital Twin — Incus-based isolation with DNS rewriting, port forwarding, and external service passthrough. Reality Check validates software in an environment closer to real deployment.

Part 3

Amplifier Tester

Purpose-built testing for the Amplifier ecosystem.

Amplifier Tester

Purpose-built testing for the Amplifier ecosystem.

Made to simplify using Gitea + Digital Twin Universe for Amplifier testing scenarios. Future replacement for the shadow-env bundle. See amplifier-bundle-amplifier-tester.

setup-digital-twin agent

Gathers context, classifies repo changes, mirrors to Gitea, dynamically generates DTU profiles, launches and verifies environments.

ecosystem-validator agent

Runs targeted validation checks inside DTUs with tested exec patterns.

Open question: Future replacement — install requirements before it can fully replace shadow-env.

Part 4

Digital Twin Improvements

File ops, streaming, mDNS, and a broader vision.

Digital Twin Universe · New Features

File operations and streaming exec.

file-push / file-pull

Push files into the environment and pull files out. Plus provision.files profile key for initial file provisioning at launch time.

--stream flag on exec

Real-time stdout/stderr passthrough. No more waiting for the full command to finish before seeing output.

Private repo support

Example profile demonstrating how to use DTU with private GitHub repositories.

Digital Twin Universe · More Improvements

mDNS, Gitea performance, and a broader vision.

mDNS support

localhost becomes myapp.local. Human-friendly hostnames for environments.

Gitea perf

Large repos went from minutes to seconds with better defaults for mirroring.

Broader scope

Evolved from a tool for testing/validating software to also supporting longer-living agent dev environments and larger-scale evaluation.

From "tests pass" to
"this actually works."

Reality Check validates TUIs end-to-end. Amplifier Tester replaces shadow-env with Gitea + DTU. Digital Twin Universe now supports file ops, streaming, mDNS, and private repos.

Date April 16, 2026 Author David Koleczek