More Amplifier Stories More Amplifier Stories →
Case Study

Making LLMs Reliable
Through Code

A debugging journey: 5 recipe versions, 3 bugs found, 1 fundamental insight about AI-assisted development.

February 2026
The Challenge

Preserve What Works

When regenerating documentation, users want to keep their existing content intact - only updating sections that need changes.

The Ask

"When I provide an existing document, preserve sections that don't need changes. Only regenerate what's actually outdated."

The Stakes

11 carefully written sections with code examples, commands, and formatting - losing any of it means manual recovery.

"Mostly, we want to make sure the existing sections reflect the source files. The existing doc is a reference for what is already mainly working."
v7.1.0

First Attempt: Let the LLM Decide

# The prompt approach If action_needed is "none": USE EXISTING CONTENT AS-IS If action_needed is "add": Keep existing, ADD missing content If action_needed is "update": Revise specific parts only
🚨

Result: Content Mangled

The LLM "preserved" content by summarizing it. Code blocks were paraphrased. Commands were reformatted. Nothing was verbatim.

Lesson: Asking an LLM to "copy exactly" doesn't mean it will.

v7.2.0

Fix #1: Use Code to Extract Content

Don't ask the LLM to preserve - use bash/Python to copy sections directly.

# Deterministic preservation with bash for section_id in $(jq -r '.[] | select(.action_needed == "none")'); do # Copy existing content directly to tracker existing_content=$(jq -r ".section_mappings[$section_id].existing_content") # No LLM involved - pure data copy jq ".generated_content[$section_id] = \$content" tracker.json done
Before

Original: 6,293 bytes

Output: 5,340 bytes

❌ Lost ~950 bytes somewhere

Still Broken

Content extraction was truncating sections. But why?

Time to investigate...

Discovery

The Markdown Parser Bug

🐛

Code Comments Mistaken for Headings

# This bash comment inside a code block... ```bash amplifier run "execute recipe.yaml" # Interactive mode <-- Parser sees this as a heading! amplifier ```

The markdown parser wasn't tracking whether it was inside a code fence. Every # comment was treated as a new section boundary, truncating content.

Section 1.2.1 content:
Before fix: 99 characters (truncated)
After fix: 737 characters (complete)

v7.5.0

Fix #2: Code Block State Tracking

def parse_markdown_sections(content): sections = [] in_code_block = False # Track state! for line in content.split('\\n'): if line.startswith('```'): in_code_block = not in_code_block continue # Only treat # as heading if NOT in code block if line.startswith('#') and not in_code_block: sections.append(parse_heading(line)) return sections
Result

Original: 6,293 bytes

Output: 7,211 bytes

✅ All content preserved + new sections added

Diff Check

Only 2 additions (intro paragraphs for empty parent sections).

Zero modifications to preserved content!

Plot Twist

But Then... Validation Broke It

Generate
Content Validation
Fix Issues
Quality Validation
Fix Issues
🤯

The "Skip These Sections" Problem

We told the LLM: "These sections are preserved - don't modify them during validation fixes."

The LLM: "Sure!" *proceeds to rewrite them anyway*

💡

Root Cause Identified

LLMs cannot reliably follow "skip these sections" instructions. They will still touch, modify, or "improve" content they were told to leave alone.

v7.4.0

Final Fix: Deterministic Restore

Don't trust the LLM to skip. Instead: let it do its thing, then restore preserved sections with code.

LLM Validation Fix
Python Restore
LLM Quality Fix
Python Restore
# After EVERY LLM validation pass, restore preserved sections def restore_preserved_sections(document, preserved_content): """Deterministically restore sections - no LLM involved""" for section_id, original in preserved_content.items(): # Find section boundaries using heading matching start, end = find_section_boundaries(document, section_id) # Replace whatever the LLM wrote with the original document = document[:start] + original + document[end:] return document # Guaranteed byte-for-byte identical
The Pattern

Trust But Verify

❌ Don't Do This
# Hoping the LLM follows instructions prompt: | Fix content issues but DO NOT modify preserved sections # Result: LLM modifies them anyway
✅ Do This Instead
# Let LLM work, then restore with code steps: - llm_validation_fix - python_restore_preserved - llm_quality_fix - python_restore_preserved # Result: 100% fidelity guaranteed
"When determinism matters, use code not LLMs."
The Journey

5 Versions in One Session

v7.1.0
Existing Document Input
LLM-based preservation with quality assessment (keep/revise/replace). Failed: LLM rewrote "preserved" content.
v7.2.0
Deterministic Copy
Bash copies sections with action_needed="none" directly. Partial: Content extraction truncated.
v7.3.0
Skip Validation for Preserved
Analysis determines what needs work, validation skips preserved. Partial: LLM still touched preserved sections.
v7.4.0
Deterministic Restore
Python restores preserved sections after each LLM pass. Success: 100% preservation.
v7.5.0
Code Block Aware Parsing
Fixed # comments in code blocks being seen as headings. Complete solution achieved.
Results

100% Fidelity

11
Preserved Sections
0
Bytes Modified
5
Recipe Versions

Validated by recipe-results-validator:
All 11 preserved sections are byte-for-byte identical to the original.

Analysis via Amplifier session-analyst and recipe-results-validator
Key Insight
💡

When Determinism Matters,
Use Code Not LLMs

LLMs are powerful for generation and analysis. But for tasks requiring exact reproduction, byte-level accuracy, or strict constraints - use deterministic code.

Takeaways

Patterns for Reliable AI Workflows

1. Sandwich Pattern

Wrap LLM operations with deterministic code. Pre-process inputs, post-process outputs.

2. Restore After

Don't ask LLMs to skip things. Let them work, then restore what shouldn't change.

3. State Tracking

Keep track of context (like in_code_block) that changes how content should be parsed.

4. Byte-Level Validation

Don't trust "looks right" - verify with checksums, diffs, or exact byte comparisons.

5. Iterative Debugging

Each fix reveals new issues. Budget time for multiple passes (5 versions in this case).

6. Code Over Prompts

When determinism matters, a 10-line Python function beats a 100-word prompt.

Sources & Methodology

About This Case Study

This deck documents a debugging session working on the document-generation-parallel.yaml recipe within the Amplifier ecosystem.

Data as of February 20, 2026

Apply This Pattern

Trust But Verify

The next time you need an LLM to "preserve" or "skip" something, consider: can you enforce that with code instead?

Recipe Source

document-generation-parallel.yaml v7.5.0

Session Analysis

Available via session-analyst agent

The debugging session that inspired this deck:
~4 hours of iterative problem-solving

1 / 15
More Amplifier Stories