Alpha v0.1.0 Active Development

Introducing

engram-lite

Persistent, vector-backed memory for Claude Code and Amplifier agents.
Local SQLite. Dual-route retrieval. Silent by design.

What if your AI coding agent remembered you
the way a great colleague does?

Python 3.11+ MIT License SQLite + sqlite-vec fastembed / ONNX

Ken Chau · Microsoft · March 2026

The Problem

The Repetition Tax

"Human collaborators build shared understanding
over time. Without memory, AI agents are
perpetual strangers."

01 Re-Explanation

"We use tabs, not spaces." "The API runs on port 8432." "We decided against Redis because…" Stated once. Restated every session.

02 Lost Knowledge

Decisions made three months ago — why you chose SQLite over Postgres, why the auth module uses that pattern — evaporate when the session closes.

03 No Relationship

A developer who has told their AI "I prefer composition over inheritance" twenty times eventually stops correcting it. The AI cannot learn.

Source: engram-lite PRD §2 — Problem Statement (2026-03-03)

The Insight

Two layers.
One memory.

Pipeline RAG is wrong for this problem. The hot stuff should always be there. The deep stuff should be retrieved on demand.

Layer 1: MEMORY.md

The Hot Surface

Prose narrative injected at every session start. Always present. No retrieval latency. The agent reads it like a colleague reading their notes before a meeting.

Always injected Prose, not JSON Agent-authored

Layer 2: Vector + Graph DB

The Deep Store

SQLite + sqlite-vec + FTS5. Dual-route retrieval: fast vector KNN fused with BM25, plus hierarchical graph traversal for broad queries. Retrieved on demand via tool calls.

On-demand recall Dual-route Graph + Vector

Design Decision

The Brilliant Revert

"We tried building a separate LLM to maintain MEMORY.md. Then we realized: the agent reading these instructions already is the LLM."

Early prototypes had a structured MEMORY.md format with dedicated refresh machinery — a separate pipeline to rewrite the file. It was scrapped. The agent simply composes its own memory narrative in prose, the same way you'd jot notes for a colleague.

Evidence: commit bcdd4c7 — "refactor: remove old structured MEMORY.md format and refresh-now machinery"

# What MEMORY.md looks like
# (authored by the agent, for the agent)

## Ken
Ken is a principal engineer at Microsoft
who works on developer tooling. He prefers
composition over inheritance, uses early
returns, and considers `any` types a code
smell. Currently building engram-lite.

## Active Context
Dual-route retrieval engine is the active
workstream. System-1 targets <50ms at 10k
memories. Using sqlite-vec for KNN.

Architecture — Mnemis-Inspired

Dual-Route Retrieval

Adapted from Mnemis: Dual-Route Retrieval on Hierarchical Graphs for Long-Term LLM Memory (arXiv:2602.15313, Tang et al., 2026). engram-lite adapts the core ideas for individual developer sessions with a local SQLite backend.

S1 System-1: Fast Path

Vector KNN via sqlite-vec + BM25 full-text via FTS5, fused with Reciprocal Rank Fusion.

Target: <50ms @ 10k memories

Best for: "What port does the API run on?"

S2 System-2: Deliberate Path

Top-down traversal of a hierarchical semantic graph. Collects structurally related memories that vector search alone would miss.

Target: <200ms @ 10k memories

Best for: "Summarize all security decisions we've made."

Auto-Routing

"kubernetes timeout"        → vector   (short, specific)
"HIPAA"                      → keyword  (all-caps acronym)
"everything about auth"      → graph    (broad signal: "everything")
"how does rate limiting work?"→ hybrid   (question-form)

Under the Hood

The Scoring Engine

Step 1: Reciprocal Rank Fusion

// Fuse KNN + BM25 ranked lists
// k = 60 (standard RRF constant)

RRF_score(d) = Σ 1 / (k + rank_i(d) + 1)

Scale-invariant: BM25 scores and cosine distances have different scales. RRF uses only rank positions.

Step 2: Final Re-Ranking

// Four-signal weighted score

final_score =
  0.40 × query_match   // cosine similarity
+ 0.25 × confidence    // 0.0 – 1.0
+ 0.20 × importance    // critical → 1.0
+ 0.15 × recency       // exp decay, 90d half-life

Relevance dominates at 40%. Recency is a tiebreaker, not the primary signal.

Latency Budget (targets at 10k memories)

<50ms

System-1 (vector)

<200ms

System-2 (graph)

<30ms

Keyword (BM25)

<100ms

Session pre-load

Source: SPEC-RETRIEVAL §15 — Performance Targets. These are design targets, not benchmarks.

Privacy Model

Privacy isn't a policy.
It's the filesystem topology.

~/.engram/                <project>/.engram/
│                             │
│ USER SPACE                │ PROJECT SPACE
│                             │
│ Your preferences            │ Architecture decisions
│ Personal workflow habits     │ Project conventions
│ Cross-project knowledge      │ Team patterns
│ People & relationships       │ Why-we-chose-X rationale
│                             │
│ NEVER leaves your machine   │ Safe to commit to git
│ NEVER leaks into project    │ Shared via version control

The README Test

Would this content be appropriate in a public README? If not, it's automatically routed to user space. PII, credentials, and private opinions are rejected from project space at the capture boundary.

Local Embeddings

fastembed uses ONNX Runtime — ~200MB install, no PyTorch, no GPU required. With Ollama, zero data leaves your machine. Default provider (OpenAI) sends only truncated content for embedding — no metadata, tags, or graph structure.

Integration

Zero-Friction Design

No pip install. No service to run. No accounts to create.

Claude Code — pick one:

# Option A: copy .mcp.json
cp /path/to/engram-lite/.mcp.json .

# Option B: register directly
claude mcp add --transport stdio \
  engram-lite -- \
  uvx --from \
    git+https://github.com/kenotron-ms/engram-lite \
    engram-lite-mcp

# That's it. Start Claude Code.
claude

Amplifier:

# Add to root bundle.md
includes:
  - bundle: git+https://github.com/kenotron-ms/engram-lite@main

3 Lifecycle Hooks

SessionStart

Inject MEMORY.md hot surface + behavioral protocol into context

UserPromptSubmit

Recall nudge — prompt the agent to check memory. Sync MEMORY.md and vector DB independently (A/B rule).

Stop

Capture reminder — evaluate what's worth remembering from this exchange

8 MCP Tools

capture recall search update forget relate graph_explore stats

Behavioral Protocol

RETRIEVE → RESPOND → CAPTURE

The three-phase loop runs silently on every conversational turn.

  User sends prompt
        │
        ↓
  RETRIEVE    Hook injects recall reminder
        │       Agent calls memory_recall / memory_search
        │       Relevant memories loaded into context
        ↓
  RESPOND     Agent responds using conversation + memories
        │       Memory operations are never mentioned to user
        ↓
  CAPTURE     Hook injects capture reminder
                Agent evaluates what’s worth remembering
                New knowledge stored with embeddings + graph links

The A/B Independence Rule

MEMORY.md (Layer A) and the vector database (Layer B) are synced independently. Don't skip one because you did the other. Hot context and deep recall serve different purposes — both must stay current.

Design Philosophy

Memory is infrastructure,
not interface.

Forbidden phrases:

"I'm saving that to memory…"

"Let me check my memory…"

"I've captured that."

"Searching memory…"

The AI simply knows things.
Like a great colleague.

Before engram-lite

Session 1: "We use tabs, not spaces."
Session 2: "Like I said, tabs not spaces."
Session 3: "TABS. NOT SPACES."
Session 4: [gives up correcting]

After engram-lite

Session 1: "We use tabs, not spaces."
Session 2: [agent uses tabs without asking]
Session 7: [agent still uses tabs]
Session 30: [agent still uses tabs]

Appendix

Sources & Methodology

Primary Sources

• README.md — project overview, tool reference, quick start
• pyproject.toml — package metadata, dependencies, version
• docs/PRD.md — product requirements, problem statement
• docs/ARCHITECTURE.md — system design, core principles
• docs/SPEC-RETRIEVAL.md — dual-route engine, RRF, re-ranking, perf targets
• docs/SPEC-PROTOCOLS.md — behavioral loop, silent operation
• Git log (5 most recent commits on main)

Academic Reference

Tang et al. "Mnemis: Dual-Route Retrieval on Hierarchical Graphs for Long-Term LLM Memory." arXiv:2602.15313, 2026.

Methodology

All numbers (RRF formula, re-ranking weights, latency targets) come directly from spec documents in the repository. Performance figures are design targets, not measured benchmarks. Commit references verified against git log.

Disclosure

Status: Alpha (Development Status :: 3 - Alpha). Author: Ken Chau, Microsoft (single primary author). Repo: kenotron-ms/engram-lite (personal GitHub org, not microsoft/).

Data as of: 2026-03-05. Deck generated from repository at commit 31f9272 (HEAD of main).

Get Started

Two commands.
Memory is active.

# Clone or just grab .mcp.json
curl -sO https://raw.githubusercontent.com/\
  kenotron-ms/engram-lite/main/.mcp.json

# Start Claude Code
claude

github.com/kenotron-ms/engram-lite

Alpha MIT License Python 3.11+ Works Today

"Instead of starting every conversation as a blank slate, the agent remembers your preferences, past decisions, project context, and working patterns — and applies them silently."