Amplifier Case Study
Where Did
the Time Go?
How Session Insights turned 304 sessions into a performance report — without crashing a single context window.
April 22, 2026
The Challenge
You can't improve what you can't measure
After a full week of building with Amplifier — 304 sessions across 6 projects — the question was simple: Where did all the time go?
The answer was buried inside session event files. The problem? Those files are hostile territory for LLMs.
💥
Context Overflow
Session event files (events.jsonl) contain lines with 100,000+ tokens. A single read_file call can blow out an entire context window.
🌳
Recursive Agent Trees
Sessions spawn child agents which spawn grandchild agents. One session had 54 child agents. Tracing that tree by hand? Impossible.
⏱
Timestamp Math
Computing durations requires pairing events, parsing ISO timestamps, bucketing by category, and handling gaps. That's computation — not inference.
The Prompt
One request. Seven agents. Five parallel analyses.
"Analyze last week's sessions and give me a timing analysis report"
session-analyst → scanned 2,992 session files → found 304 from last week
perf-analyst #1 → Session Browser → 9h 56m active, 1,372 LLM calls
perf-analyst #2 → DTU Setup → 9h 36m active, 1,473 LLM calls
perf-analyst #3 → DTU Acceptance → 20.3 hours active, 976 LLM calls
perf-analyst #4 → Grove QA → 4h 15m active, recipe debugging
perf-analyst #5 → Transcript Talk → 10m 16s active, clean run
The Architecture
Two bundles. One mission.
The analysis relied on a handoff between two Amplifier bundles — foundation for session discovery, and session-insights for performance analysis.
Root Agent (claude-opus-4-6)
"Analyze last week's sessions"
Step 1: delegate → foundation:session-analyst
"Find all sessions from April 14-18"
→ scans 2,992 files → returns 304 matches
Step 2: delegate x 5 → session-insights:perf-analyst
Each calls timing_analysis at depth 2
Each returns ~500 token summary
Step 3: synthesize → cross-session performance report
The Secret Weapon
timing_analysis: computation the LLM doesn't have to do
The timing_analysis tool is a purpose-built Python module that does everything an LLM would waste tokens trying to figure out — and does it in milliseconds.
1
Parse — Reads events.jsonl, classifies events by type (LLM, tool, delegation, session lifecycle)
↓
2
Segment — Detects turn boundaries, pairs tool:pre/tool:post by call ID, matches llm:request/llm:response
↓
3
Recurse — Follows delegate:agent_spawned events into child sessions, up to depth 4
↓
4
Compute — Calculates active/idle time, LLM/tool/delegation budgets, bottleneck rankings
↓
5
Compress — Returns a compact structured summary (~2-3K tokens) — no raw event payloads
The Breakthrough
The token math that makes it work
Here's the fundamental insight: this tool does computationally what would otherwise require LLM tokens. Timestamp math, event correlation, tree traversal, duration bucketing — these are computation problems, not language problems.
Without timing_analysis
read_file("events.jsonl")
grep("llm:response", events.jsonl)
With timing_analysis
timing_analysis(path="/path/to/session")
By the Numbers
What one prompt actually analyzed
304
Sessions Discovered
Across 6 projects, 1 week
5
Deep Analyses
Parallel, recursive to depth 2
~15K
Tokens in Root
From summaries only — not raw data
44+ hrs
Active Time Profiled
Across all analyzed sessions
5,000+
LLM Calls Tracked
Across the full session trees
7
Agents Coordinated
1 analyst + 5 profilers + 1 root
The Pattern
Context Sink: agents absorb tokens so you don't have to
The multi-agent architecture multiplied the efficiency of the purpose-built tool. Each performance analyst ran in its own context window — absorbing the token cost of its session's analysis locally and returning only a distilled summary to the root.
"The tool does computation. The agent does interpretation. The root does synthesis. Each layer compresses by an order of magnitude."
The Findings
What the analysis actually discovered
Every session told the same story: they were LLM-bound on claude-opus-4-6, with 67–99% of agent time spent waiting for model inference. But the specifics were where it got interesting.
The Runaway Planner
Session Browser — Turn 61
A single plan-writer delegation consumed 2 hours 43 minutes — 27.8% of the entire session's active time. Nearly all LLM inference. Recommendation: use a faster model for planning agents.
The Monster Recipe
DTU Setup — Turn 36
One recipe execution ran for 4 hours 9 minutes — 43% of a 9.6-hour session. Hidden overhead between delegation steps suggested coordination issues in recipe execution.
The Continue Storm
Session Browser — Turns 30–34
The LLM returned 0 output tokens repeatedly across 5 turns, wasting 13 minutes. A real bug — the model was being asked to continue when there was nothing to continue.
The Chatty Browser-Tester
DTU Setup — 7 invocations
Browser testing averaged 72 LLM calls per invocation, consuming 69 minutes total. Each test was a multi-round conversation when it could have been a single assertion.
The Silent Failure
Grove QA
terminal-tester failed 3 out of 3 attempts in a QA session. A real bug in the testing agent that was only surfaced by systematic performance analysis.
Under the Hood
Clean architecture, ruthless simplicity
The session-insights bundle is small by design: one tool, one agent, 71 tests, zero runtime dependencies beyond stdlib. The analysis engine is pure — no file I/O — making it fully testable with synthetic data.
def analyze_session(
events: list[dict],
metadata: dict | None,
child_resolver: Callable,
max_depth: int = 3,
) -> dict:
"""Pure analysis. No file I/O. Fully testable."""
analyzer.py
Pure analysis engine — takes event lists, returns timing trees. No file I/O. All 559 lines are computation: turn detection, time budgets, bottleneck ranking, recursive child analysis.
parser.py
Event classification and pairing utilities. Detects turn boundaries, pairs tool:pre/tool:post by call ID, extracts delegation info, parses timestamps.
resolver.py
All filesystem I/O lives here — isolated by design. Reads events.jsonl, loads metadata.json, resolves child session paths. The only impure module.
The Principles
Four Amplifier patterns in action
🔧
Mechanism, Not Policy
The tool provides timing data. The agent provides interpretation. The tool never says "this is slow" — it says "this took 4h9m, which is 43% of active time." The agent decides what that means.
🕳
Context Sink
Each child agent absorbs the token-heavy work in its own context window and returns a compressed summary. The root session never sees raw event data — only structured insights.
⚡
Tools for Computation
Timestamp math, event correlation, tree traversal, duration bucketing — these are computation problems. The tool handles them in milliseconds. The LLM handles what tools can't: narrative, interpretation, recommendations.
🎯
Ruthless Simplicity
One tool call replaces what would be hundreds of file reads. One agent replaces manual event-log spelunking. The entire bundle is 4 Python files and 1 agent prompt.
The Impact
From "where did the time go?" to actionable recommendations
The user got a comprehensive weekly performance report covering cross-session patterns, specific bottlenecks with exact durations and percentages, and actionable recommendations. Here's what changed:
| Finding |
Recommendation |
Est. Savings |
| Everything LLM-bound on Opus |
Use Sonnet for child agents |
~50% faster delegations |
| Plan-writer took 2h43m |
Faster model for planners |
~2 hours/session |
| Browser-tester: 72 calls/run |
Batch assertions, fewer rounds |
~1 hour/session |
| Terminal-tester: 3/3 failures |
Fix the agent bug |
Unblocks QA |
| Recipe overhead gaps |
Pre-validate before running |
~30 min/recipe |
"It found a bug I didn't even know existed. The terminal-tester was failing silently every time — only systematic analysis caught it."
The Insight
The ideal division of labor
🔩
Tools Handle
Parsing events.jsonl
Timestamp arithmetic
Event correlation and pairing
Recursive tree traversal
Duration bucketing
Bottleneck ranking
= Deterministic computation
🧠
LLMs Handle
Identifying significance
Narrative construction
Cross-session patterns
Actionable recommendations
Context-aware interpretation
Human-readable reporting
= Judgment and language
The best AI systems don't make the LLM do everything.
They make the LLM do only what LLMs are best at.
Try It Yourself
Get started in 30 seconds
Session Insights is an Amplifier bundle. Add it to your configuration, and every session gains performance analysis capabilities.
includes:
- bundle: git+https://github.com/microsoft/amplifier-bundle-session-insights@main
"Why is my session slow?"
"Break down the timing of my last session"
"Which agents took the most time?"
"Analyze last week's sessions"
Quick Check
"Why is this session slow?" — gets a single-session breakdown with bottleneck rankings and recommendations.
Deep Dive
Set max_depth=4 to follow delegation chains into grandchild agents. See exactly where time goes at every level.
Weekly Report
Combine with session-analyst to scan a full week of work and get cross-session patterns and optimization opportunities.
Sources
Methodology
Data as of: April 22, 2026
Feature status: Active
Session analyzed: Real Amplifier session from April 22, 2026. The user asked "Analyze last week's sessions and give me a timing analysis report" and the system executed the multi-agent workflow described in this case study.
Data sources:
- Session event data:
~/.amplifier/projects/*/sessions/*/events.jsonl (304 sessions from April 14–18, 2026)
- Bundle source:
amplifier-bundle-session-insights — 4 Python modules, 1 agent, 71 tests
- Foundation bundle:
amplifier-foundation — provides session-analyst agent
- Session transcript: parent conversation between user and Amplifier root agent
Metrics derivation:
- "304 sessions" — counted by
session-analyst scanning metadata files with date filtering
- "2,992 session files" — total sessions across all projects in
~/.amplifier/projects/
- Active time, LLM calls, tool calls — computed by
timing_analysis from event timestamps
- "~15K tokens in root" — estimated from summary lengths returned by child agents
Gaps: Token estimates for child agent contexts are approximate. "Millions of tokens" for raw data is an order-of-magnitude estimate based on typical events.jsonl file sizes.
Bundle author: Amplifier team
More Amplifier Stories