Persistent, vector-backed memory for Claude Code and Amplifier agents.
Local SQLite. Dual-route retrieval. Silent by design.
What if your AI coding agent remembered you
the way a great colleague does?
Ken Chau · Microsoft · March 2026
"Human collaborators build shared understanding
over time. Without memory, AI agents are
perpetual strangers."
"We use tabs, not spaces." "The API runs on port 8432." "We decided against Redis because…" Stated once. Restated every session.
Decisions made three months ago — why you chose SQLite over Postgres, why the auth module uses that pattern — evaporate when the session closes.
A developer who has told their AI "I prefer composition over inheritance" twenty times eventually stops correcting it. The AI cannot learn.
Source: engram-lite PRD §2 — Problem Statement (2026-03-03)
Pipeline RAG is wrong for this problem. The hot stuff should always be there. The deep stuff should be retrieved on demand.
The Hot Surface
Prose narrative injected at every session start. Always present. No retrieval latency. The agent reads it like a colleague reading their notes before a meeting.
The Deep Store
SQLite + sqlite-vec + FTS5. Dual-route retrieval: fast vector KNN fused with BM25, plus hierarchical graph traversal for broad queries. Retrieved on demand via tool calls.
"We tried building a separate LLM to maintain MEMORY.md. Then we realized: the agent reading these instructions already is the LLM."
Early prototypes had a structured MEMORY.md format with dedicated refresh machinery — a separate pipeline to rewrite the file. It was scrapped. The agent simply composes its own memory narrative in prose, the same way you'd jot notes for a colleague.
Evidence: commit bcdd4c7 —
"refactor: remove old structured MEMORY.md format and refresh-now machinery"
Adapted from Mnemis: Dual-Route Retrieval on Hierarchical Graphs for Long-Term LLM Memory (arXiv:2602.15313, Tang et al., 2026). engram-lite adapts the core ideas for individual developer sessions with a local SQLite backend.
Vector KNN via sqlite-vec + BM25 full-text via FTS5, fused with Reciprocal Rank Fusion.
Target: <50ms @ 10k memories
Best for: "What port does the API run on?"
Top-down traversal of a hierarchical semantic graph. Collects structurally related memories that vector search alone would miss.
Target: <200ms @ 10k memories
Best for: "Summarize all security decisions we've made."
Auto-Routing
Step 1: Reciprocal Rank Fusion
Scale-invariant: BM25 scores and cosine distances have different scales. RRF uses only rank positions.
Step 2: Final Re-Ranking
Relevance dominates at 40%. Recency is a tiebreaker, not the primary signal.
Latency Budget (targets at 10k memories)
<50ms
System-1 (vector)
<200ms
System-2 (graph)
<30ms
Keyword (BM25)
<100ms
Session pre-load
Source: SPEC-RETRIEVAL §15 — Performance Targets. These are design targets, not benchmarks.
Would this content be appropriate in a public README? If not, it's automatically routed to user space. PII, credentials, and private opinions are rejected from project space at the capture boundary.
fastembed uses ONNX Runtime — ~200MB install, no PyTorch, no GPU required. With Ollama, zero data leaves your machine. Default provider (OpenAI) sends only truncated content for embedding — no metadata, tags, or graph structure.
No pip install. No service to run. No accounts to create.
Claude Code — pick one:
Amplifier:
3 Lifecycle Hooks
Inject MEMORY.md hot surface + behavioral protocol into context
Recall nudge — prompt the agent to check memory. Sync MEMORY.md and vector DB independently (A/B rule).
Capture reminder — evaluate what's worth remembering from this exchange
8 MCP Tools
The three-phase loop runs silently on every conversational turn.
MEMORY.md (Layer A) and the vector database (Layer B) are synced independently. Don't skip one because you did the other. Hot context and deep recall serve different purposes — both must stay current.
Forbidden phrases:
"I'm saving that to memory…"
"Let me check my memory…"
"I've captured that."
"Searching memory…"
The AI simply knows things.
Like a great colleague.
README.md — project overview, tool reference, quick start
pyproject.toml — package metadata, dependencies, version
docs/PRD.md — product requirements, problem statement
docs/ARCHITECTURE.md — system design, core principles
docs/SPEC-RETRIEVAL.md — dual-route engine, RRF, re-ranking, perf targets
docs/SPEC-PROTOCOLS.md — behavioral loop, silent operation
main)
Tang et al. "Mnemis: Dual-Route Retrieval on Hierarchical Graphs for Long-Term LLM Memory." arXiv:2602.15313, 2026.
All numbers (RRF formula, re-ranking weights, latency targets)
come directly from spec documents in the repository.
Performance figures are design targets,
not measured benchmarks.
Commit references verified against git log.
Status: Alpha (Development Status :: 3 - Alpha).
Author: Ken Chau, Microsoft (single primary author).
Repo: kenotron-ms/engram-lite (personal GitHub org, not microsoft/).
Data as of: 2026-03-05. Deck generated from repository at commit 31f9272 (HEAD of main).
github.com/kenotron-ms/engram-lite
"Instead of starting every conversation as a blank slate, the agent remembers your preferences, past decisions, project context, and working patterns — and applies them silently."