Amplifier Voice

From Zero to Real-Time AI
in Two Weeks

A WebRTC voice assistant built on Amplifier.
Three architectures. 30 commits. One developer.

Brian Krabach · January 30 – February 2026

Active

The Question

“Can Amplifier
do voice?”

Voice is the interface everyone asks about first.
Not chat. Not IDE. Voice.

The challenge: voice AI isn't a feature you bolt on. It needs real-time streaming, sub-second latency, and an architecture that lets an AI think while it talks.

The Answer

Yes. And it's novel.

Not a wrapper around a speech API. A voice-first AI that delegates to a team of specialists.

🎤

You speak

→

🌐

WebRTC

→

🧠

GPT Realtime

→

⚡

Amplifier

→

🤖

Agent Team

Explore code

"Look at the auth module and tell me what you find"

Write code

"Add a caching layer to the API endpoint"

Debug

"Why is the test for session forking failing?"

Research

"Find the latest docs on WebRTC data channels"

Architect

"Design a retry strategy for flaky connections"

Ship

"Commit this and create a PR"

The Journey

Three Architectures in Four Days

Each pivot was a learning moment. Each one made the system fundamentally better.

Day 1

Pivot 1 Many Tools → Orchestration Only

Started with flight tracking, weather, multiple tools exposed to the voice model. Realized: a voice model with limited context shouldn't run tools directly. Stripped everything. Left only task.

Day 3

Pivot 2 Task Tool → Delegate Tool

task was fire-and-forget. Switched to delegate for context control, session resumption, and provider selection. Multi-turn agent conversations became possible.

Day 3

Pivot 3 Auto-Respond → Manual Response Control

Default behavior: model speaks immediately after detecting silence. New behavior: the model decides WHEN to speak, not just what to say. First attempt reverted. Second attempt succeeded.

1

Pivot 1 · Day 1

The Voice Model Is an Orchestrator,
Not an Executor

Before

// Voice model had direct access to everything
tools: [
  "flight_tracker",
  "weather",
  "calendar",
  "todo",
  "task",
  "web_search",
  ...
]

Too many tools. Limited context window. Model confused about when to use what.

After

// Voice model has ONE job: orchestrate
REALTIME_TOOLS = {"task"}

// All real work delegated to specialists
// explorer, architect, builder, bug-hunter...

One tool. Clear purpose. Delegates everything to the right specialist agent.

Key insight: A real-time voice model has a tiny context window. Don't make it think about tools. Make it think about who to ask.

2

Pivot 2 · Day 3

From Fire-and-Forget
to Conversation

task created one-shot agents. delegate enables multi-turn dialogues with persistent state.

Context Control

// Agent gets exactly the right context
delegate({
  agent: "foundation:explorer",
  context_depth: "recent",  // none | recent | all
  context_turns: 5
})

Session Resumption

// Resume a prior agent conversation
delegate({
  agent: "foundation:modular-builder",
  session_id: "abc-123..."  // pick up where we left off
})

Provider Selection

// Voice = GPT Realtime, Agents = Claude
provider_preferences: [
  { provider: "anthropic",
    model: "claude-sonnet-*" }
]

Specialist Routing

// Right agent for the right job
"foundation:explorer"      // scan code
"foundation:bug-hunter"    // debug
"foundation:zen-architect" // design
"foundation:git-ops"       // ship

3

Pivot 3 · Day 3

The Model Decides
When to Speak

Default voice AI: detect silence → auto-respond.
Amplifier Voice: detect silence → model chooses its moment.

// Semantic VAD + manual response control
session.update({
  audio: {
    input: {
      turn_detection: {
        type: "semantic_vad",
        eagerness: "low",
        create_response: false,  // KEY
        interrupt_response: true
      }
    }
  }
})

// After transcription, client triggers:
dataChannel.send(
  JSON.stringify({ type: "response.create" })
)

First attempt: reverted

"True silence mode" was too aggressive. Broke natural conversation flow. Commit fa859f0 → Revert 78441f5

Second attempt: shipped

Separate "detecting end of speech" from "auto-generating response." Model uses instructions to decide engagement level. Commit a2f6042

Why it matters

The model can listen to a multi-sentence request without interrupting after the first pause. Natural, human-like conversation.

Hard Problem #1

Results Arrive While
the Model Is Speaking

You ask three things. The model starts answering the first. Meanwhile, agents finish the other two. What happens?

// Track response state
const responseInProgress = useRef(false);
const pendingAnnouncements = useRef([]);

// Tool result arrives while speaking?
if (responseInProgress.current) {
  // Queue it. Don't interrupt.
  pendingAnnouncements.current.push({
    toolName, callId
  });
} else {
  // Not speaking? Report immediately.
  triggerResponse();
}

// When model finishes speaking:
case "response.done":
  responseInProgress.current = false;
  setTimeout(() => {
    if (pending.length > 0) {
      flushPendingAnnouncements();
    }
  }, 100);

The Flush Message

"The explorer and architect tasks completed while you were speaking. Please report those results now briefly."

No Interruptions

Model finishes its current thought before reporting late-arriving results.

No Lost Results

Every tool result is queued and announced. Nothing falls through.

Natural Flow

Feels like a coworker saying "Oh, and I also found..." after finishing a thought.

Hard Problem #2

Concurrent Tool Calls
That Don't Collide

"Explore the auth module AND check the test coverage" — two agents, running simultaneously, each with a unique tracking ID.

// Each tool call gets a unique ID from OpenAI
const statusMessage = {
  sender: "system",
  text: `Delegating to ${getFriendlyToolName(toolCall.name)}...`,
  toolCallId: toolCall.id,     // <-- unique per concurrent call
  toolStatus: "executing"
};

// Update ONLY this specific call's status on completion
setMessages(prev => prev.map(msg =>
  msg.toolCallId === toolCall.id && msg.toolStatus === "executing"
    ? { ...msg, text: `Completed ${name}`, toolStatus: "completed" }
    : msg
))

Parallel Execution

Multiple agents run at the same time. No serialization bottleneck.

Isolated Tracking

toolCallId maps each result to its originating request. No cross-talk.

Live UI Updates

Each task shows status independently: delegating → executing → completed.

Debugging Visibility

22 Event Types, Streamed
Live to Your Browser

Every Amplifier event — from LLM requests to agent forks — appears in real-time via Server-Sent Events.

🔼 provider:request

🔽 provider:response

🔧 tool:pre

🔧 tool:post

🔧 tool:error

🔀 session:fork

🔀 session:join

🧠 thinking:delta

🧠 thinking:final

content_block:start

content_block:delta

🔔 user:notification

context:compaction

✅ approval:request

Server-Side Hook

# Captures ALL Amplifier events
EVENTS_TO_CAPTURE = [
  "content_block:start",
  "content_block:delta",
  "thinking:delta",
  "tool:pre", "tool:post",
  "session:fork", "session:join",
  "provider:request",
  "llm:request:raw",
  ...  # 22 event types total
]

Why This Matters

Voice AI is a black box. You say something, something happens, you get audio back. SSE streaming makes the invisible visible.

See Claude thinking. See agents spawning. See tool calls executing. See token costs accumulating. All in real-time, in a browser console with color-coded icons.

Production Resilience

Connection Health Monitoring
and Smart Reconnection

WebRTC sessions have a 60-minute hard limit. Connections drop. Networks flake. The system handles all of it.

Health States

Healthy

Warning

Critical

Disconnected

Disconnect Reasons Tracked

idle_timeout

session_limit

connection_failed

data_channel_closed

stale_connection

network_error

Thresholds

idleWarning:     2 min
sessionWarning:  55 min  // 5 min before limit
sessionLimit:    60 min  // OpenAI hard cap
staleThreshold:  30 sec

4 Reconnection Strategies

manual — User clicks to reconnect

auto_immediate — Instant retry

auto_delayed — Backoff then retry

proactive — Reconnect before expiry

The Stack

Two Processes. Zero Wrappers.
Direct API Calls.

voice-server · Python

FastAPI + Uvicorn — HTTP/SSE server

httpx — Async OpenAI Realtime API calls

amplifier-foundation — Agent framework

sse-starlette — Event streaming

Amplifier bridge executes tools via direct Python calls. Zero subprocess overhead.

voice-client · TypeScript

React 18 + Vite — UI framework

Fluent UI + Copilot Components — Microsoft design system

Zustand — Lightweight state management

WebRTC — Real-time audio streaming

5 custom hooks: useWebRTC, useVoiceChat, useChatMessages, useAmplifierEvents, useConnectionHealth

Multi-Model Architecture

Voice layer: OpenAI gpt-realtime (GA) — speech-to-speech, real-time audio
Agent layer: Anthropic Claude Sonnet — deep reasoning, code generation, tool use
Each model does what it's best at. Voice handles conversation. Claude handles thinking.

Velocity

By the Numbers

30

commits (Jan 30 – Feb 18)

4

days of development

1

developer

3

architecture pivots

100+

research docs

5

custom React hooks

22

event types streamed

Why this was possible: Amplifier's modular architecture meant each pivot was a configuration change, not a rewrite. The voice model didn't need to change — only its relationship to the agent framework did. Three architectures. Same voice client. Same agent roster.

What's Novel

Not a Demo. A Pattern.

Voice as Orchestration Layer

Most voice assistants are single-model, single-tool systems. This is a voice interface to a team of AI specialists. The voice model doesn't write code — it coordinates the agents that do.

Intentional Speech

Manual response control means the model is never forced to speak. It can listen to complex, multi-sentence requests. It can think before responding. It speaks when it has something to say.

Async-First Architecture

Tool results arrive whenever they're ready — before, during, or after the model speaks. The system handles all three cases gracefully. No blocking. No dropped results.

Multi-Model by Design

GPT Realtime for voice. Claude for reasoning. Each model does what it's best at. Not a compromise — an architecture.

Full Observability

22 event types streamed live. See every LLM call, every tool execution, every agent fork. Voice AI doesn't have to be a black box.

Rapid Architecture Evolution

Three fundamental pivots in four days. Amplifier's modularity made each change surgical, not seismic. The lesson: good infrastructure enables fearless iteration.

Sources

Research Methodology

Data as of: February 20, 2026

Feature status: Active

Research performed:

Git log analysis: git log --oneline amplifier-voice (30 commits found)
Contributor analysis: git log --format="%an"
Date range: extracted from git log timestamps

Gaps: Lines of code and file count not extracted; research doc count (100+) is an estimate from repo structure

Primary contributors: Brian Krabach (29 commits, ~97%), Sam Schillace (1 commit, ~3%)

What's Next

Talk to your code.

Amplifier Voice proves that voice isn't just a UI layer — it's a fundamentally different way to interact with AI agent teams.

Try It

Clone amplifier-voice. Follow QUICKSTART.md. Talking to agents in under 5 minutes.

Extend It

Add your own agents. Home Assistant integration is already in progress.

Learn From It

100+ research docs in ai-context/. Architecture decisions documented in every commit.

Built by Brian Krabach · Powered by Amplifier · January – February 2026

From Zero to Real-Time AIin Two Weeks

“Can Amplifierdo voice?”

Yes. And it's novel.

Three Architectures in Four Days

The Voice Model Is an Orchestrator,Not an Executor

From Fire-and-Forgetto Conversation

The Model DecidesWhen to Speak

Results Arrive Whilethe Model Is Speaking

Concurrent Tool CallsThat Don't Collide

22 Event Types, StreamedLive to Your Browser

Connection Health Monitoringand Smart Reconnection

Two Processes. Zero Wrappers.Direct API Calls.