Alpha v0.1.0 Active Development

engram-lite

Persistent, vector-backed memory for Claude Code and Amplifier agents.
Local SQLite. Dual-route retrieval. Silent by design.

What if your AI coding agent remembered you
the way a great colleague does?

Python 3.11+ MIT License SQLite + sqlite-vec fastembed / ONNX

Ken Chau · Microsoft · March 2026

The Repetition Tax

"Human collaborators build shared understanding
over time. Without memory, AI agents are
perpetual strangers."

01  Re-Explanation

"We use tabs, not spaces." "The API runs on port 8432." "We decided against Redis because…" Stated once. Restated every session.

02  Lost Knowledge

Decisions made three months ago — why you chose SQLite over Postgres, why the auth module uses that pattern — evaporate when the session closes.

03  No Relationship

A developer who has told their AI "I prefer composition over inheritance" twenty times eventually stops correcting it. The AI cannot learn.

Source: engram-lite PRD §2 — Problem Statement (2026-03-03)

Two layers.
One memory.

Pipeline RAG is wrong for this problem. The hot stuff should always be there. The deep stuff should be retrieved on demand.

Layer 1: MEMORY.md

The Hot Surface

Prose narrative injected at every session start. Always present. No retrieval latency. The agent reads it like a colleague reading their notes before a meeting.

Always injected Prose, not JSON Agent-authored

Layer 2: Vector + Graph DB

The Deep Store

SQLite + sqlite-vec + FTS5. Dual-route retrieval: fast vector KNN fused with BM25, plus hierarchical graph traversal for broad queries. Retrieved on demand via tool calls.

On-demand recall Dual-route Graph + Vector

The Brilliant Revert

"We tried building a separate LLM to maintain MEMORY.md. Then we realized: the agent reading these instructions already is the LLM."

Early prototypes had a structured MEMORY.md format with dedicated refresh machinery — a separate pipeline to rewrite the file. It was scrapped. The agent simply composes its own memory narrative in prose, the same way you'd jot notes for a colleague.

Evidence: commit bcdd4c7 — "refactor: remove old structured MEMORY.md format and refresh-now machinery"

# What MEMORY.md looks like # (authored by the agent, for the agent) ## Ken Ken is a principal engineer at Microsoft who works on developer tooling. He prefers composition over inheritance, uses early returns, and considers `any` types a code smell. Currently building engram-lite. ## Active Context Dual-route retrieval engine is the active workstream. System-1 targets <50ms at 10k memories. Using sqlite-vec for KNN.

Dual-Route Retrieval

Adapted from Mnemis: Dual-Route Retrieval on Hierarchical Graphs for Long-Term LLM Memory (arXiv:2602.15313, Tang et al., 2026). engram-lite adapts the core ideas for individual developer sessions with a local SQLite backend.

S1 System-1: Fast Path

Vector KNN via sqlite-vec + BM25 full-text via FTS5, fused with Reciprocal Rank Fusion.

Target: <50ms @ 10k memories

Best for: "What port does the API run on?"

S2 System-2: Deliberate Path

Top-down traversal of a hierarchical semantic graph. Collects structurally related memories that vector search alone would miss.

Target: <200ms @ 10k memories

Best for: "Summarize all security decisions we've made."

Auto-Routing

"kubernetes timeout" → vector (short, specific) "HIPAA" → keyword (all-caps acronym) "everything about auth" → graph (broad signal: "everything") "how does rate limiting work?"→ hybrid (question-form)

The Scoring Engine

Step 1: Reciprocal Rank Fusion

// Fuse KNN + BM25 ranked lists // k = 60 (standard RRF constant) RRF_score(d) = Σ 1 / (k + rank_i(d) + 1)

Scale-invariant: BM25 scores and cosine distances have different scales. RRF uses only rank positions.

Step 2: Final Re-Ranking

// Four-signal weighted score final_score = 0.40 × query_match // cosine similarity + 0.25 × confidence // 0.0 – 1.0 + 0.20 × importance // critical → 1.0 + 0.15 × recency // exp decay, 90d half-life

Relevance dominates at 40%. Recency is a tiebreaker, not the primary signal.

Latency Budget (targets at 10k memories)

<50ms

System-1 (vector)

<200ms

System-2 (graph)

<30ms

Keyword (BM25)

<100ms

Session pre-load

Source: SPEC-RETRIEVAL §15 — Performance Targets. These are design targets, not benchmarks.

Privacy isn't a policy.
It's the filesystem topology.

~/.engram/ <project>/.engram/ USER SPACE PROJECT SPACE Your preferences Architecture decisions Personal workflow habits Project conventions Cross-project knowledge Team patterns People & relationships Why-we-chose-X rationale NEVER leaves your machine Safe to commit to git NEVER leaks into project Shared via version control

The README Test

Would this content be appropriate in a public README? If not, it's automatically routed to user space. PII, credentials, and private opinions are rejected from project space at the capture boundary.

Local Embeddings

fastembed uses ONNX Runtime — ~200MB install, no PyTorch, no GPU required. With Ollama, zero data leaves your machine. Default provider (OpenAI) sends only truncated content for embedding — no metadata, tags, or graph structure.

Zero-Friction Design

No pip install. No service to run. No accounts to create.

Claude Code — pick one:

# Option A: copy .mcp.json cp /path/to/engram-lite/.mcp.json . # Option B: register directly claude mcp add --transport stdio \ engram-lite -- \ uvx --from \ git+https://github.com/kenotron-ms/engram-lite \ engram-lite-mcp # That's it. Start Claude Code. claude

Amplifier:

# Add to root bundle.md includes: - bundle: git+https://github.com/kenotron-ms/engram-lite@main

3 Lifecycle Hooks

SessionStart

Inject MEMORY.md hot surface + behavioral protocol into context

UserPromptSubmit

Recall nudge — prompt the agent to check memory. Sync MEMORY.md and vector DB independently (A/B rule).

Stop

Capture reminder — evaluate what's worth remembering from this exchange

8 MCP Tools

capture recall search update forget relate graph_explore stats

RETRIEVE → RESPOND → CAPTURE

The three-phase loop runs silently on every conversational turn.

User sends prompt RETRIEVE Hook injects recall reminder Agent calls memory_recall / memory_search Relevant memories loaded into context RESPOND Agent responds using conversation + memories Memory operations are never mentioned to user CAPTURE Hook injects capture reminder Agent evaluates what’s worth remembering New knowledge stored with embeddings + graph links

The A/B Independence Rule

MEMORY.md (Layer A) and the vector database (Layer B) are synced independently. Don't skip one because you did the other. Hot context and deep recall serve different purposes — both must stay current.

Memory is infrastructure,
not interface.

Forbidden phrases:

"I'm saving that to memory…"

"Let me check my memory…"

"I've captured that."

"Searching memory…"

The AI simply knows things.
Like a great colleague.

Before engram-lite

Session 1: "We use tabs, not spaces." Session 2: "Like I said, tabs not spaces." Session 3: "TABS. NOT SPACES." Session 4: [gives up correcting]

After engram-lite

Session 1: "We use tabs, not spaces." Session 2: [agent uses tabs without asking] Session 7: [agent still uses tabs] Session 30: [agent still uses tabs]

Sources & Methodology

Primary Sources

  • README.md — project overview, tool reference, quick start
  • pyproject.toml — package metadata, dependencies, version
  • docs/PRD.md — product requirements, problem statement
  • docs/ARCHITECTURE.md — system design, core principles
  • docs/SPEC-RETRIEVAL.md — dual-route engine, RRF, re-ranking, perf targets
  • docs/SPEC-PROTOCOLS.md — behavioral loop, silent operation
  • Git log (5 most recent commits on main)

Academic Reference

Tang et al. "Mnemis: Dual-Route Retrieval on Hierarchical Graphs for Long-Term LLM Memory." arXiv:2602.15313, 2026.

Methodology

All numbers (RRF formula, re-ranking weights, latency targets) come directly from spec documents in the repository. Performance figures are design targets, not measured benchmarks. Commit references verified against git log.

Disclosure

Status: Alpha (Development Status :: 3 - Alpha). Author: Ken Chau, Microsoft (single primary author). Repo: kenotron-ms/engram-lite (personal GitHub org, not microsoft/).

Data as of: 2026-03-05. Deck generated from repository at commit 31f9272 (HEAD of main).

Two commands.
Memory is active.

# Clone or just grab .mcp.json curl -sO https://raw.githubusercontent.com/\ kenotron-ms/engram-lite/main/.mcp.json # Start Claude Code claude

github.com/kenotron-ms/engram-lite

Alpha MIT License Python 3.11+ Works Today

"Instead of starting every conversation as a blank slate, the agent remembers your preferences, past decisions, project context, and working patterns — and applies them silently."

More Amplifier Stories