Amplifier Case Study

Where Did
the Time Go?

How Session Insights turned 304 sessions into a performance report — without crashing a single context window.

April 22, 2026
The Challenge

You can't improve what you can't measure

After a full week of building with Amplifier — 304 sessions across 6 projects — the question was simple: Where did all the time go?

The answer was buried inside session event files. The problem? Those files are hostile territory for LLMs.

💥
Context Overflow
Session event files (events.jsonl) contain lines with 100,000+ tokens. A single read_file call can blow out an entire context window.
🌳
Recursive Agent Trees
Sessions spawn child agents which spawn grandchild agents. One session had 54 child agents. Tracing that tree by hand? Impossible.
Timestamp Math
Computing durations requires pairing events, parsing ISO timestamps, bucketing by category, and handling gaps. That's computation — not inference.
The Prompt

One request. Seven agents. Five parallel analyses.

# The user typed: "Analyze last week's sessions and give me a timing analysis report" # What Amplifier actually did: session-analyst → scanned 2,992 session files → found 304 from last week perf-analyst #1 → Session Browser → 9h 56m active, 1,372 LLM calls perf-analyst #2 → DTU Setup → 9h 36m active, 1,473 LLM calls perf-analyst #3 → DTU Acceptance → 20.3 hours active, 976 LLM calls perf-analyst #4 → Grove QA → 4h 15m active, recipe debugging perf-analyst #5 → Transcript Talk → 10m 16s active, clean run
The Architecture

Two bundles. One mission.

The analysis relied on a handoff between two Amplifier bundles — foundation for session discovery, and session-insights for performance analysis.

+---------------------------------------------------------+ | Root Agent (claude-opus-4-6) | | "Analyze last week's sessions" | | | | Step 1: delegate → foundation:session-analyst | | "Find all sessions from April 14-18" | | → scans 2,992 files → returns 304 matches | | | | Step 2: delegate x 5 → session-insights:perf-analyst | | Each calls timing_analysis at depth 2 | | Each returns ~500 token summary | | | | Step 3: synthesize → cross-session performance report | +---------------------------------------------------------+
The Secret Weapon

timing_analysis: computation the LLM doesn't have to do

The timing_analysis tool is a purpose-built Python module that does everything an LLM would waste tokens trying to figure out — and does it in milliseconds.

1
Parse — Reads events.jsonl, classifies events by type (LLM, tool, delegation, session lifecycle)
2
Segment — Detects turn boundaries, pairs tool:pre/tool:post by call ID, matches llm:request/llm:response
3
Recurse — Follows delegate:agent_spawned events into child sessions, up to depth 4
4
Compute — Calculates active/idle time, LLM/tool/delegation budgets, bottleneck rankings
5
Compress — Returns a compact structured summary (~2-3K tokens) — no raw event payloads
The Breakthrough

The token math that makes it work

Here's the fundamental insight: this tool does computationally what would otherwise require LLM tokens. Timestamp math, event correlation, tree traversal, duration bucketing — these are computation problems, not language problems.

Without timing_analysis
# Naive approach: read event files directly read_file("events.jsonl") # 100K+ tokens per file # x 5 sessions = 500K+ tokens # x recursive child sessions = millions # Result: context overflow, session crash # Alternative: grep for patterns grep("llm:response", events.jsonl) # Gets text matches, not timing data # Can't compute durations from grep # No tree structure, no budgets
With timing_analysis
# One tool call per session timing_analysis(path="/path/to/session") # Returns: ~2-3K tokens # Contains: full timing tree # Includes: recursive child analysis # Budgeted: for LLM context # The agent interprets structure, # not raw data. # Clean division of labor.
By the Numbers

What one prompt actually analyzed

304
Sessions Discovered
Across 6 projects, 1 week
5
Deep Analyses
Parallel, recursive to depth 2
~15K
Tokens in Root
From summaries only — not raw data
44+ hrs
Active Time Profiled
Across all analyzed sessions
5,000+
LLM Calls Tracked
Across the full session trees
7
Agents Coordinated
1 analyst + 5 profilers + 1 root
The Pattern

Context Sink: agents absorb tokens so you don't have to

The multi-agent architecture multiplied the efficiency of the purpose-built tool. Each performance analyst ran in its own context window — absorbing the token cost of its session's analysis locally and returning only a distilled summary to the root.

Raw Data
Millions of tokens
Tool Output
~12K tokens (5 sessions)
Agent Reports
~5K tokens (summaries)
Root Context
~15K total
"The tool does computation. The agent does interpretation. The root does synthesis. Each layer compresses by an order of magnitude."
The Findings

What the analysis actually discovered

Every session told the same story: they were LLM-bound on claude-opus-4-6, with 67–99% of agent time spent waiting for model inference. But the specifics were where it got interesting.

The Runaway Planner
Session Browser — Turn 61
A single plan-writer delegation consumed 2 hours 43 minutes — 27.8% of the entire session's active time. Nearly all LLM inference. Recommendation: use a faster model for planning agents.
The Monster Recipe
DTU Setup — Turn 36
One recipe execution ran for 4 hours 9 minutes — 43% of a 9.6-hour session. Hidden overhead between delegation steps suggested coordination issues in recipe execution.
The Continue Storm
Session Browser — Turns 30–34
The LLM returned 0 output tokens repeatedly across 5 turns, wasting 13 minutes. A real bug — the model was being asked to continue when there was nothing to continue.
The Chatty Browser-Tester
DTU Setup — 7 invocations
Browser testing averaged 72 LLM calls per invocation, consuming 69 minutes total. Each test was a multi-round conversation when it could have been a single assertion.
The Silent Failure
Grove QA
terminal-tester failed 3 out of 3 attempts in a QA session. A real bug in the testing agent that was only surfaced by systematic performance analysis.
Under the Hood

Clean architecture, ruthless simplicity

The session-insights bundle is small by design: one tool, one agent, 71 tests, zero runtime dependencies beyond stdlib. The analysis engine is pure — no file I/O — making it fully testable with synthetic data.

# The engine is injected with a resolver, not coupled to the filesystem def analyze_session( events: list[dict], # Already parsed from JSONL metadata: dict | None, # Session metadata child_resolver: Callable, # (session_id) -> events | None max_depth: int = 3, # Recursion limit (max 4) ) -> dict: # -> Pre-computed timing tree """Pure analysis. No file I/O. Fully testable."""
analyzer.py
Pure analysis engine — takes event lists, returns timing trees. No file I/O. All 559 lines are computation: turn detection, time budgets, bottleneck ranking, recursive child analysis.
parser.py
Event classification and pairing utilities. Detects turn boundaries, pairs tool:pre/tool:post by call ID, extracts delegation info, parses timestamps.
resolver.py
All filesystem I/O lives here — isolated by design. Reads events.jsonl, loads metadata.json, resolves child session paths. The only impure module.
The Principles

Four Amplifier patterns in action

🔧
Mechanism, Not Policy
The tool provides timing data. The agent provides interpretation. The tool never says "this is slow" — it says "this took 4h9m, which is 43% of active time." The agent decides what that means.
🕳
Context Sink
Each child agent absorbs the token-heavy work in its own context window and returns a compressed summary. The root session never sees raw event data — only structured insights.
Tools for Computation
Timestamp math, event correlation, tree traversal, duration bucketing — these are computation problems. The tool handles them in milliseconds. The LLM handles what tools can't: narrative, interpretation, recommendations.
🎯
Ruthless Simplicity
One tool call replaces what would be hundreds of file reads. One agent replaces manual event-log spelunking. The entire bundle is 4 Python files and 1 agent prompt.
The Impact

From "where did the time go?" to actionable recommendations

The user got a comprehensive weekly performance report covering cross-session patterns, specific bottlenecks with exact durations and percentages, and actionable recommendations. Here's what changed:

Finding Recommendation Est. Savings
Everything LLM-bound on Opus Use Sonnet for child agents ~50% faster delegations
Plan-writer took 2h43m Faster model for planners ~2 hours/session
Browser-tester: 72 calls/run Batch assertions, fewer rounds ~1 hour/session
Terminal-tester: 3/3 failures Fix the agent bug Unblocks QA
Recipe overhead gaps Pre-validate before running ~30 min/recipe
"It found a bug I didn't even know existed. The terminal-tester was failing silently every time — only systematic analysis caught it."
The Insight

The ideal division of labor

🔩
Tools Handle
Parsing events.jsonl
Timestamp arithmetic
Event correlation and pairing
Recursive tree traversal
Duration bucketing
Bottleneck ranking
= Deterministic computation
🧠
LLMs Handle
Identifying significance
Narrative construction
Cross-session patterns
Actionable recommendations
Context-aware interpretation
Human-readable reporting
= Judgment and language

The best AI systems don't make the LLM do everything.
They make the LLM do only what LLMs are best at.

Try It Yourself

Get started in 30 seconds

Session Insights is an Amplifier bundle. Add it to your configuration, and every session gains performance analysis capabilities.

# 1. Include the bundle in your Amplifier config includes: - bundle: git+https://github.com/microsoft/amplifier-bundle-session-insights@main # 2. Ask any performance question "Why is my session slow?" "Break down the timing of my last session" "Which agents took the most time?" "Analyze last week's sessions" # 3. The system automatically delegates to the right agents # Root -> session-performance-analyst -> timing_analysis tool # No manual event parsing. No context overflow. Just answers.
Quick Check
"Why is this session slow?" — gets a single-session breakdown with bottleneck rankings and recommendations.
Deep Dive
Set max_depth=4 to follow delegation chains into grandchild agents. See exactly where time goes at every level.
Weekly Report
Combine with session-analyst to scan a full week of work and get cross-session patterns and optimization opportunities.
Sources

Methodology

Data as of: April 22, 2026

Feature status: Active

Session analyzed: Real Amplifier session from April 22, 2026. The user asked "Analyze last week's sessions and give me a timing analysis report" and the system executed the multi-agent workflow described in this case study.

Data sources:

Metrics derivation:

Gaps: Token estimates for child agent contexts are approximate. "Millions of tokens" for raw data is an order-of-magnitude estimate based on typical events.jsonl file sizes.

Bundle author: Amplifier team

More Amplifier Stories