Amplifier Case Study

Where Did
the Time Go?

How Session Insights turned 304 sessions into a performance report — without crashing a single context window.

April 22, 2026

The Challenge

You can't improve what you can't measure

After a full week of building with Amplifier — 304 sessions across 6 projects — the question was simple: Where did all the time go?

The answer was buried inside session event files. The problem? Those files are hostile territory for LLMs.

💥

Context Overflow

Session event files (events.jsonl) contain lines with 100,000+ tokens. A single read_file call can blow out an entire context window.

🌳

Recursive Agent Trees

Sessions spawn child agents which spawn grandchild agents. One session had 54 child agents. Tracing that tree by hand? Impossible.

⏱

Timestamp Math

Computing durations requires pairing events, parsing ISO timestamps, bucketing by category, and handling gaps. That's computation — not inference.

The Prompt

One request. Seven agents. Five parallel analyses.

# The user typed:
"Analyze last week's sessions and give me a timing analysis report"

# What Amplifier actually did:
session-analyst     → scanned 2,992 session files → found 304 from last week
perf-analyst #1     → Session Browser    → 9h 56m active, 1,372 LLM calls
perf-analyst #2     → DTU Setup          → 9h 36m active, 1,473 LLM calls
perf-analyst #3     → DTU Acceptance     → 20.3 hours active, 976 LLM calls
perf-analyst #4     → Grove QA    → 4h 15m active, recipe debugging
perf-analyst #5     → Transcript Talk    → 10m 16s active, clean run
    

The Architecture

Two bundles. One mission.

The analysis relied on a handoff between two Amplifier bundles — foundation for session discovery, and session-insights for performance analysis.

+---------------------------------------------------------+ | Root Agent (claude-opus-4-6) | | "Analyze last week's sessions" | | | | Step 1: delegate → foundation:session-analyst | | "Find all sessions from April 14-18" | | → scans 2,992 files → returns 304 matches | | | | Step 2: delegate x 5 → session-insights:perf-analyst | | Each calls timing_analysis at depth 2 | | Each returns ~500 token summary | | | | Step 3: synthesize → cross-session performance report | +---------------------------------------------------------+

The Secret Weapon

timing_analysis: computation the LLM doesn't have to do

The timing_analysis tool is a purpose-built Python module that does everything an LLM would waste tokens trying to figure out — and does it in milliseconds.

1

Parse — Reads events.jsonl, classifies events by type (LLM, tool, delegation, session lifecycle)

↓

2

Segment — Detects turn boundaries, pairs tool:pre/tool:post by call ID, matches llm:request/llm:response

↓

3

Recurse — Follows delegate:agent_spawned events into child sessions, up to depth 4

↓

4

Compute — Calculates active/idle time, LLM/tool/delegation budgets, bottleneck rankings

↓

5

Compress — Returns a compact structured summary (~2-3K tokens) — no raw event payloads

The Breakthrough

The token math that makes it work

Here's the fundamental insight: this tool does computationally what would otherwise require LLM tokens. Timestamp math, event correlation, tree traversal, duration bucketing — these are computation problems, not language problems.

Without timing_analysis

# Naive approach: read event files directly
read_file("events.jsonl")  # 100K+ tokens per file
# x 5 sessions = 500K+ tokens
# x recursive child sessions = millions
# Result: context overflow, session crash

# Alternative: grep for patterns
grep("llm:response", events.jsonl)
# Gets text matches, not timing data
# Can't compute durations from grep
# No tree structure, no budgets
            

With timing_analysis

# One tool call per session
timing_analysis(path="/path/to/session")
# Returns: ~2-3K tokens
# Contains: full timing tree
# Includes: recursive child analysis
# Budgeted: for LLM context

# The agent interprets structure,
# not raw data.
# Clean division of labor.
            

By the Numbers

What one prompt actually analyzed

304

Sessions Discovered

Across 6 projects, 1 week

5

Deep Analyses

Parallel, recursive to depth 2

~15K

Tokens in Root

From summaries only — not raw data

44+ hrs

Active Time Profiled

Across all analyzed sessions

5,000+

LLM Calls Tracked

Across the full session trees

7

Agents Coordinated

1 analyst + 5 profilers + 1 root

The Pattern

Context Sink: agents absorb tokens so you don't have to

The multi-agent architecture multiplied the efficiency of the purpose-built tool. Each performance analyst ran in its own context window — absorbing the token cost of its session's analysis locally and returning only a distilled summary to the root.

Raw Data

Millions of tokens

Tool Output

~12K tokens (5 sessions)

Agent Reports

~5K tokens (summaries)

Root Context

~15K total

"The tool does computation. The agent does interpretation. The root does synthesis. Each layer compresses by an order of magnitude."

The Findings

What the analysis actually discovered

Every session told the same story: they were LLM-bound on claude-opus-4-6, with 67–99% of agent time spent waiting for model inference. But the specifics were where it got interesting.

The Runaway Planner

Session Browser — Turn 61

A single plan-writer delegation consumed 2 hours 43 minutes — 27.8% of the entire session's active time. Nearly all LLM inference. Recommendation: use a faster model for planning agents.

The Monster Recipe

DTU Setup — Turn 36

One recipe execution ran for 4 hours 9 minutes — 43% of a 9.6-hour session. Hidden overhead between delegation steps suggested coordination issues in recipe execution.

The Continue Storm

Session Browser — Turns 30–34

The LLM returned 0 output tokens repeatedly across 5 turns, wasting 13 minutes. A real bug — the model was being asked to continue when there was nothing to continue.

The Chatty Browser-Tester

DTU Setup — 7 invocations

Browser testing averaged 72 LLM calls per invocation, consuming 69 minutes total. Each test was a multi-round conversation when it could have been a single assertion.

The Silent Failure

Grove QA

terminal-tester failed 3 out of 3 attempts in a QA session. A real bug in the testing agent that was only surfaced by systematic performance analysis.

Under the Hood

Clean architecture, ruthless simplicity

The session-insights bundle is small by design: one tool, one agent, 71 tests, zero runtime dependencies beyond stdlib. The analysis engine is pure — no file I/O — making it fully testable with synthetic data.

# The engine is injected with a resolver, not coupled to the filesystem
def analyze_session(
    events: list[dict],             # Already parsed from JSONL
    metadata: dict | None,          # Session metadata
    child_resolver: Callable,       # (session_id) -> events | None
    max_depth: int = 3,             # Recursion limit (max 4)
) -> dict:                          # -> Pre-computed timing tree
    """Pure analysis. No file I/O. Fully testable."""
    

analyzer.py

Pure analysis engine — takes event lists, returns timing trees. No file I/O. All 559 lines are computation: turn detection, time budgets, bottleneck ranking, recursive child analysis.

parser.py

Event classification and pairing utilities. Detects turn boundaries, pairs tool:pre/tool:post by call ID, extracts delegation info, parses timestamps.

resolver.py

All filesystem I/O lives here — isolated by design. Reads events.jsonl, loads metadata.json, resolves child session paths. The only impure module.

The Principles

Four Amplifier patterns in action

🔧

Mechanism, Not Policy

The tool provides timing data. The agent provides interpretation. The tool never says "this is slow" — it says "this took 4h9m, which is 43% of active time." The agent decides what that means.

🕳

Context Sink

Each child agent absorbs the token-heavy work in its own context window and returns a compressed summary. The root session never sees raw event data — only structured insights.

⚡

Tools for Computation

Timestamp math, event correlation, tree traversal, duration bucketing — these are computation problems. The tool handles them in milliseconds. The LLM handles what tools can't: narrative, interpretation, recommendations.

🎯

Ruthless Simplicity

One tool call replaces what would be hundreds of file reads. One agent replaces manual event-log spelunking. The entire bundle is 4 Python files and 1 agent prompt.

The Impact

From "where did the time go?" to actionable recommendations

The user got a comprehensive weekly performance report covering cross-session patterns, specific bottlenecks with exact durations and percentages, and actionable recommendations. Here's what changed:

Finding	Recommendation	Est. Savings
Everything LLM-bound on Opus	Use Sonnet for child agents	~50% faster delegations
Plan-writer took 2h43m	Faster model for planners	~2 hours/session
Browser-tester: 72 calls/run	Batch assertions, fewer rounds	~1 hour/session
Terminal-tester: 3/3 failures	Fix the agent bug	Unblocks QA
Recipe overhead gaps	Pre-validate before running	~30 min/recipe

"It found a bug I didn't even know existed. The terminal-tester was failing silently every time — only systematic analysis caught it."

The Insight

The ideal division of labor

🔩

Tools Handle

Parsing events.jsonl
Timestamp arithmetic
Event correlation and pairing
Recursive tree traversal
Duration bucketing
Bottleneck ranking
= Deterministic computation

🧠

LLMs Handle

Identifying significance
Narrative construction
Cross-session patterns
Actionable recommendations
Context-aware interpretation
Human-readable reporting
= Judgment and language

The best AI systems don't make the LLM do everything.
They make the LLM do only what LLMs are best at.

Try It Yourself

Get started in 30 seconds

Session Insights is an Amplifier bundle. Add it to your configuration, and every session gains performance analysis capabilities.

# 1. Include the bundle in your Amplifier config
includes:
  - bundle: git+https://github.com/microsoft/amplifier-bundle-session-insights@main

# 2. Ask any performance question
"Why is my session slow?"
"Break down the timing of my last session"
"Which agents took the most time?"
"Analyze last week's sessions"

# 3. The system automatically delegates to the right agents
#    Root -> session-performance-analyst -> timing_analysis tool
#    No manual event parsing. No context overflow. Just answers.
    

Quick Check

"Why is this session slow?" — gets a single-session breakdown with bottleneck rankings and recommendations.

Deep Dive

Set max_depth=4 to follow delegation chains into grandchild agents. See exactly where time goes at every level.

Weekly Report

Combine with session-analyst to scan a full week of work and get cross-session patterns and optimization opportunities.

Sources

Methodology

Data as of: April 22, 2026

Feature status: Active

Session analyzed: Real Amplifier session from April 22, 2026. The user asked "Analyze last week's sessions and give me a timing analysis report" and the system executed the multi-agent workflow described in this case study.

Data sources:

Session event data: ~/.amplifier/projects/*/sessions/*/events.jsonl (304 sessions from April 14–18, 2026)
Bundle source: amplifier-bundle-session-insights — 4 Python modules, 1 agent, 71 tests
Foundation bundle: amplifier-foundation — provides session-analyst agent
Session transcript: parent conversation between user and Amplifier root agent

Metrics derivation:

"304 sessions" — counted by session-analyst scanning metadata files with date filtering
"2,992 session files" — total sessions across all projects in ~/.amplifier/projects/
Active time, LLM calls, tool calls — computed by timing_analysis from event timestamps
"~15K tokens in root" — estimated from summary lengths returned by child agents

Gaps: Token estimates for child agent contexts are approximate. "Millions of tokens" for raw data is an order-of-magnitude estimate based on typical events.jsonl file sizes.

Bundle author: Amplifier team

Where Didthe Time Go?