MemPalace: The 96.6% Recall AI Memory System That Is Mostly ChromaDB With a Good Philosophy

In partnership with

The headline benchmark is technically real and substantially misleading. Both facts are worth understanding. The core contribution, verbatim storage with zero LLM calls at write time and a spatial hierarchy for scoped retrieval, is a defensible design choice that beats extraction-based competitors on recall. The palace hierarchy itself hurts performance. Knowing which part actually works is the analysis most coverage skips.

SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | June 13, 2026

The standard AI memory system makes a bet at write time: it runs an LLM over your conversation, extracts what it decides is important, and stores that summary. The bet is that the LLM's extraction will capture everything you later need to retrieve. When it is right, you get a compact, fast, queryable memory. When it is wrong, the information is gone permanently. The extraction decision is irreversible.

MemPalace rejects this bet. It stores everything verbatim. Nothing is extracted, summarized, or paraphrased. Retrieval happens semantically over the raw text. If you need it later, it is there. If the embedding model finds it, you get it. No extraction failures, no irreversible information loss.

This is a contrarian position in a field that has converged on extraction. Mem0, Mastra, Zep, and similar systems all make write-time extraction decisions. MemPalace does not. The question is whether verbatim storage's recall advantages outweigh its storage and context costs. On LongMemEval, the answer is clearly yes at 96.6% Recall@5. The qualification: that 96.6% is ChromaDB's default embedding model performing nearest-neighbor search on uncompressed text. The palace hierarchy (Wings to Rooms to Drawers) is metadata filtering on top. The independent analysis from Issue #29 and the arXiv critical paper (arXiv:2604.21284) confirmed it: the palace architecture itself regresses retrieval performance when tested in isolation.

Scope: MemPalace's verbatim storage design and genuine contributions, the five-level spatial hierarchy and what it actually does technically, the benchmark facts accurately stated, AAAK compression and its real costs, MemMachine (arXiv:2604.04853) as the architecturally adjacent comparison. Not covered: Mem0, Zep, or Mastra beyond brief benchmark context.

What It Actually Does

MemPalace (MIT, created by Milla Jovovich and Ben Sigman using Claude Code) is a local-first AI memory layer with two runtime dependencies: ChromaDB and PyYAML. It runs entirely locally. No API keys. No cloud. No subscription.

The write path is fully deterministic and LLM-free:

Raw conversation text chunked at 800 characters with 100-character overlap
Rule-based classification (regex + keyword matching) assigns chunks to hierarchy levels
Chunks stored verbatim in a single ChromaDB collection (mempalace_drawers)
Temporal entity-relationship triples stored in local SQLite
No LLM call at any point during write

The hierarchy (five levels):

Level	What it holds	Technical implementation
Wings	Top-level domains (a person, project, topic)	ChromaDB metadata field
Halls	Memory types (facts, events, advice, emotional context)	ChromaDB metadata field
Rooms	Specific subjects within a wing (auth, billing, deployment)	ChromaDB metadata field
Closets	Compressed summaries (AAAK-compressed text)	Separate ChromaDB entries
Drawers	Verbatim originals (the actual data)	Main ChromaDB entries

Tunnels: when the same room name (e.g., "auth-migration") appears under multiple wings, the system creates a cross-wing link in SQLite, enabling cross-entity retrieval.

Installation and wake-up:

pip install mem-palace
# or:
git clone https://github.com/MemPalace/mempalace.git
cd mempalace && uv sync --extra dev

# Wake-up cost: ~170 tokens (L0 system prompt + L1 palace index)
# This is the real differentiator vs. extraction-based systems:
# no boot-time LLM call needed to reconstruct the memory state

The Architecture, Unpacked

Focus on the retrieval path. The palace hierarchy performs exactly one function at retrieval time: metadata filtering in ChromaDB. This is a standard and effective technique available in any vector database. The performance advantage comes from verbatim storage combined with all-MiniLM-L6-v2 on full text, not from the Wings/Rooms/Drawers organization.

The Code, Annotated

Snippet One: Write Path (Zero LLM, Verbatim Storage)

# MemPalace write path: deterministic chunking and verbatim storage
# Source: MemPalace/mempalace (MIT) — reconstructed from README + issue analysis
# The design: no LLM at write time = zero API cost, zero extraction risk

import chromadb
from chromadb.utils import embedding_functions

# ── SETUP: single ChromaDB collection for all drawers ─────────────────────────
client = chromadb.PersistentClient(path="./mempalace_data")

# ← all-MiniLM-L6-v2 is the default embedding model
# This is the model whose performance drives the 96.6% benchmark score
# It is pluggable: swap via mempalace/backends/base.py interface
# ← The headline number is this model's performance on verbatim text retrieval
#   It is NOT a new embedding model or a novel retrieval approach
embed_fn = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)
collection = client.get_or_create_collection(
    "mempalace_drawers",
    embedding_function=embed_fn,
)


def write_to_palace(
    text: str,
    wing: str,    # e.g., "alice" (person) or "project-x" (project)
    room: str,    # e.g., "auth" or "billing"
    hall: str = "facts",   # memory type: facts/events/advice/emotion
    chunk_size: int = 800,
    overlap: int = 100,
) -> list[str]:
    """
    Store text verbatim in the palace. Zero LLM calls at any point.
    
    ← WHY verbatim? Extraction-based systems (Mem0, Zep) run an LLM to decide
      what to store. If the LLM misses something, it's gone permanently.
      MemPalace stores everything and lets the embedding model surface it later.
      The tradeoff: more storage, more retrieval context, better recall.
    
    ← The chunking is the only transformation applied.
      800 chars is large enough to preserve semantic context,
      100-char overlap prevents splitting mid-sentence.
    """
    chunks = []
    for start in range(0, len(text), chunk_size - overlap):
        chunk = text[start:start + chunk_size]
        if chunk.strip():
            chunks.append(chunk)

    # ← THIS is the trick: Wing, Hall, Room stored as ChromaDB metadata
    # At retrieval time, these become filter parameters
    # This is standard vector DB metadata filtering, not a novel palace mechanism
    ids = [f"{wing}_{room}_{i}" for i in range(len(chunks))]
    collection.add(
        documents=chunks,         # verbatim text (no summary, no extraction)
        ids=ids,
        metadatas=[{
            "wing": wing,
            "hall": hall,
            "room": room,
            "type": "drawer",     # raw verbatim entry
        } for _ in chunks]
    )
    return ids


# ── AAAK COMPRESSION (optional, degrades performance): ───────────────────────
# AAAK = regex entity codes + keyword frequency + 55-char sentence truncation
# This is rule-based only (no LLM)
# ← Despite "30x lossless compression" marketing claim, it is LOSSY
#   LongMemEval drops from 96.6% to 84.2% in AAAK mode (-12.4pp)
#   The decode() function is string splitting — no original text reconstruction
#   Do NOT use AAAK if recall accuracy matters

def aaak_compress_LOSSY(text: str) -> str:
    """
    Rule-based compression: regex entity codes + keyword frequency.
    ← NOT lossless. The original text cannot be fully reconstructed.
    ← Benchmark shows -12.4 percentage point retrieval quality loss.
    ← Only use if storage constraints outweigh recall accuracy requirements.
    """
    # Truncate sentences to 55 chars, extract keyword codes
    sentences = text.split('. ')
    compressed = ' | '.join(s[:55] for s in sentences if s.strip())
    return compressed   # lossy abbreviation, not compression

# Usage: store verbatim in drawers (default), AAAK in closets (optional)
write_to_palace(
    text="Alice mentioned she prefers async communication over Slack meetings.",
    wing="alice",
    room="communication-preferences",
    hall="facts",
)
# ← Chunk stored verbatim in ChromaDB
# ← No LLM call made
# ← Wake-up cost: ~170 tokens total for palace index retrieval

The collection.add() call with wing/hall/room as metadata is the entire "palace structure" at the code level. It is ChromaDB metadata tagging. Effective, low-latency, and standard. The spatial metaphor is a useful mental model for organizing those tags. It is not a novel technical mechanism.

Snippet Two: Retrieval and Benchmark Reality Check

# MemPalace retrieval: what the benchmark actually measures
# Source: MemPalace/mempalace (MIT) + lhl/agentic-memory analysis
# Understanding what 96.6% means and when you get it

from typing import Optional

def retrieve_from_palace(
    query: str,
    n_results: int = 5,
    wing: Optional[str] = None,    # scope to specific person/project
    room: Optional[str] = None,    # scope to specific topic
) -> list[dict]:
    """
    Retrieve verbatim chunks using semantic search.
    
    ← The 96.6% LongMemEval R@5 score is measured in 'raw mode':
      - No wing/room filtering applied
      - Query runs against ALL drawers in the single ChromaDB collection
      - This is nearest-neighbor search with all-MiniLM-L6-v2
      - The palace structure (wing/room filtering) is NOT used in this mode
      
    ← Independent M2 Ultra replication confirmed:
      Adding wing/room metadata filtering (using the palace) REDUCES recall
      The scoping helps precision but hurts recall on benchmark questions
      that require cross-entity retrieval
    """
    where_filter = {}
    if wing and room:
        # ← Scoped retrieval: faster, but loses cross-wing signals
        where_filter = {"$and": [{"wing": wing}, {"room": room}]}
    elif wing:
        where_filter = {"wing": wing}
    # ← Note: no filter = raw mode = benchmark performance

    results = collection.query(
        query_texts=[query],
        n_results=n_results,
        where=where_filter if where_filter else None,
    )
    return [
        {"text": doc, "metadata": meta, "distance": dist}
        for doc, meta, dist in zip(
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0],
        )
    ]


# ── BENCHMARK MODES EXPLAINED ──────────────────────────────────────────────────
# RAW MODE: 96.6% LongMemEval R@5
#   - retrieve_from_palace(query, n_results=5, wing=None, room=None)
#   - This is standard ChromaDB nearest-neighbor search on verbatim text
#   - No API calls required, runs fully locally

# HYBRID MODE: 100% LongMemEval R@5
#   - retrieve_from_palace(query, n_results=20) → rerank with Haiku API
#   - ← Requires Haiku API call at query time (not zero-API)
#   - The "100%" requires cloud model access

# PALACE MODE: below 96.6% (regression)
#   - retrieve_from_palace(query, wing="alice", room="preferences")
#   - ← Useful for precision-focused queries, hurts recall on broad queries
#   - The benchmark was NOT run in this mode

# AAAK MODE: 84.2% LongMemEval R@5
#   - Same retrieval, but documents were stored in AAAK-compressed form
#   - 12.4pp recall loss from compression artifacts
#   - ← Do not use AAAK if you care about recall accuracy


# ── MEMMACHINE COMPARISON ─────────────────────────────────────────────────────
# MemMachine (arXiv:2604.04853, April 2026) takes the same verbatim philosophy
# but adds:
# - Short-term + long-term episodic + profile memory layers
# - Contextualized retrieval: expand nucleus matches with surrounding context
#   ← Improves recall when evidence spans multiple dialogue turns
# - LongMemEvalS: 93.0% accuracy vs ~96.6% MemPalace recall@5
#   (different metrics: accuracy vs recall)
# - LoCoMo benchmark: 0.9169 with gpt4.1-mini
# - ~80% fewer input tokens than Mem0 under matched conditions
# ← Both MemPalace and MemMachine converge on the same core insight:
#   verbatim storage + retrieval-time selection beats extraction-based systems

The key comparison: where_filter = {} (raw mode, benchmark performance) versus where_filter = {"wing": wing} (palace mode, below benchmark). The spatial metaphor is the organizing principle. The performance comes from the embedding model on verbatim text. Both are true simultaneously.

It In Action: End-to-End Worked Example

Scenario: Claude Code session across three days, using MemPalace for cross-session memory

Day 1: Writing memories

Conversation chunk: "Alice needs the auth migration to work with the new OAuth 
provider before the Monday demo. She's blocked on the PKCE flow implementation."

write_to_palace(
    text="Alice needs the auth migration to work with the new OAuth provider...",
    wing="alice",
    room="auth-migration",
    hall="events",
)
# Stored verbatim in ChromaDB
# Metadata: wing="alice", room="auth-migration", hall="events", type="drawer"
# No LLM call made. Cost: $0.

write_to_palace(
    text="Project-X OAuth migration needs PKCE flow. Monday demo deadline.",
    wing="project-x",
    room="auth-migration",
    hall="facts",
)
# Tunnel created: "auth-migration" appears in both wings
# SQLite entry: (alice) -[auth-migration]→ (project-x)

Day 3: Retrieval (new Claude Code session)

Query: "What was Alice working on for the Monday demo?"

retrieve_from_palace(
    query="What was Alice working on for the Monday demo?",
    n_results=5,
    # ← No wing/room filter: raw mode, benchmark performance
)

Wake-up cost:
  L0: system prompt with palace overview (~120 tokens)
  L1: palace index (wing list + room summaries) (~50 tokens)
  Total: ~170 tokens
  ← Vs extraction-based systems: often require a full memory reconstruction
    LLM call at session start (500-2000 tokens)

Retrieved chunks (R@5):
  Rank 1: "Alice needs the auth migration to work with the new OAuth provider 
           before the Monday demo..." [distance: 0.12]
  Rank 2: "Project-X OAuth migration needs PKCE flow. Monday demo deadline." 
           [distance: 0.18]
  Rank 3: [earlier alice conversation chunk, distance: 0.31]
  Rank 4: [earlier project-x chunk, distance: 0.35]
  Rank 5: [less relevant chunk, distance: 0.48]

Retrieval time: ~35ms (local ChromaDB)
API calls: zero

MemMachine on same scenario:

MemMachine contextualized retrieval:
  Nucleus match: Rank 1 chunk (same as above)
  Context expansion: retrieves N surrounding chunks from same episode
  → "The PKCE flow was implemented on day 2: Alice confirmed it works
     with the test OAuth provider but not production."
  ← This surrounding-context retrieval recovers information that
    MemPalace's isolated chunk retrieval might miss
  LongMemEvalS accuracy: 93.0% (accuracy metric, vs MemPalace's 96.6% recall)
  Token cost: ~80% fewer input tokens than Mem0 under matched conditions

Why This Design Works, and What It Trades Away

The verbatim storage philosophy is correct for recall-critical applications. Extraction-based systems make irreversible decisions at write time. MemPalace defers all decisions to retrieval time, where better tools (rerankers, hybrid search, larger LLMs) are available to make them. The 96.6% Recall@5 versus Mem0's pre-update ~49% is a large real gap on a real benchmark, even after properly attributing the score to ChromaDB's embedding model rather than to the palace hierarchy.

The zero-LLM write path has compounding advantages beyond cost: determinism, offline operation, and no write-time latency spikes from LLM calls. A memory system that writes synchronously in a Claude Code session cannot afford 1-2 second LLM inference delays on every conversation turn. MemPalace's 800-char chunking and rule-based classification are effectively instantaneous.

The ~170-token wake-up cost is genuinely differentiated. Extraction-based systems that store summaries need to pass those summaries into context at session start, which scales with memory volume. MemPalace's L0+L1 bootstrap is fixed regardless of how many drawers exist.

What MemPalace trades away:

Storage and context costs grow without bound. Every conversation chunk is stored verbatim, forever. The retrieved context is raw text, not a compressed summary. As the memory grows to thousands of sessions, retrieval context density increases. MemPalace does not currently address memory decay, importance weighting, or context budget management for very long-lived agents.

The AAAK compression that was marketed as "30x lossless" addresses storage growth but does so lossily. The 12.4 percentage point recall drop in AAAK mode is a meaningful quality cost. There is no currently documented path to compression that does not hurt recall.

The palace hierarchy is a mental model, not a performance feature. Using wing/room filtering improves precision for queries where you know exactly which entity and topic to scope to. It reduces recall for ambiguous or cross-entity queries. For production use, the decision between scoped and unscoped retrieval requires understanding your query distribution, which MemPalace does not currently provide tooling to analyze.

Technical Moats

Verbatim storage as a philosophical commitment. MemPalace is not the first system to store verbatim text. It is the first to make this the explicit central design decision and to benchmark it systematically against extraction alternatives at the same time Mem0 was at ~49% recall. The verbatim philosophy is reproducible by any team, but it requires accepting the storage and context growth tradeoffs that extraction-based competitors avoid. The moat is the willingness to accept those tradeoffs, backed by benchmark validation.

The 33 MCP tools. The MCP integration covering palace reads/writes, knowledge graph operations, and tunnel navigation provides a usable interface for Claude Code users without writing custom integration code. This ecosystem integration work is not technically novel but creates real adoption friction for alternatives.

MemMachine's contextualized retrieval as the next design step. MemMachine's nucleus-plus-context expansion addresses the main recall failure mode in MemPalace: relevant evidence that spans multiple conversation turns. MemPalace returns isolated chunks. MemMachine returns those chunks plus surrounding episode context. The 93.0% accuracy on LongMemEvalS and ~80% token reduction vs Mem0 suggest that verbatim storage with contextualized retrieval is where the field is moving.

Insights

Insight One: The 96.6% benchmark number is both accurate and misleading, and both facts matter. It is accurate because the benchmark was run correctly on MemPalace's actual system. It is misleading because the performance is attributable to ChromaDB's all-MiniLM-L6-v2 embedding model applied to verbatim text, not to the palace spatial hierarchy. Any system that stores conversation text verbatim in ChromaDB with the default embedding model and runs nearest-neighbor search would score approximately the same on LongMemEval. The genuine MemPalace contribution is the decision to use verbatim storage and to validate that this beats extraction at scale. The palace metaphor is the organizational affordance, not the performance engine.

Insight Two: The April 2026 AI memory landscape changed in two weeks in opposite directions simultaneously. Mem0 published a token-efficient algorithm update that raised their LongMemEval score from approximately 49% to 93.4%, narrowing the gap to 3.2 percentage points. And MemPalace's independent audit confirmed that its palace structure regresses performance. This convergence is the correct read on the field: verbatim storage is winning, and the gap between verbatim and smart-extraction is closing rapidly as extraction systems improve their token efficiency. Teams choosing a memory architecture today should assume Mem0-class extraction systems will match verbatim recall within one or two more iterations.

Surprising Takeaway

MemPalace was built by Milla Jovovich and a systems engineer using Claude Code. It accumulated 47,000 GitHub stars in two weeks, triggered an independent arXiv analysis within days of launch, had its benchmark claims publicly corrected via GitHub issues, walked back the 100% headline in the README, and then shipped v3.1.0 with the corrections integrated, all within approximately three weeks of launch. The arc from viral launch to community audit to public correction to corrected v3.1.0 happened faster than the typical academic peer review cycle. The open-source AI project lifecycle is now running at a pace where a project's marketing claims are independently audited, reproduced, and formally critiqued on arXiv before its v3.0 release. This is new, and it is a net positive for the field.

TL;DR For Engineers

MemPalace (MemPalace/mempalace, MIT, 47k+ stars, April 2026): verbatim storage with zero LLM calls at write time, ChromaDB + SQLite, ~170-token wake-up cost, 33 MCP tools. Two dependencies: chromadb, pyyaml. Runs fully locally.
The 96.6% LongMemEval Recall@5 (raw mode) is ChromaDB's default embedding model (all-MiniLM-L6-v2) performing nearest-neighbor search on uncompressed verbatim text. The palace hierarchy (Wings/Halls/Rooms/Closets/Drawers) is ChromaDB metadata filtering. Independent M2 Ultra replication confirmed the palace structure regresses retrieval when applied. The 100% hybrid score requires Haiku API calls at query time.
AAAK compression drops LongMemEval from 96.6% to 84.2% (-12.4pp). It is lossy despite "lossless" marketing. The decode() function cannot reconstruct original text. Do not use AAAK if recall accuracy matters.
Mem0 April 2026 update: 93.4% LongMemEval (from ~49%). Gap to MemPalace: 3.2pp. The extraction vs. verbatim gap is closing.
MemMachine (arXiv:2604.04853, April 2026): same verbatim philosophy, adds contextualized retrieval (nucleus + surrounding context), 93.0% LongMemEvalS accuracy, LoCoMo 0.9169 with gpt4.1-mini, ~80% fewer input tokens vs Mem0.

The Verbatim Philosophy Is Correct. The Palace Is Optional.

MemPalace's lasting contribution is validating at scale that verbatim storage beats extraction-based competitors on recall, and that a zero-LLM write path with a low wake-up cost is deployable and fast enough for production use. Those are real contributions that the 47,000 stars reflect accurately.

The spatial palace metaphor is genuinely useful as an organizational concept for humans managing AI memory. It is not the performance driver. The performance driver is "store the text, search it well." MemMachine is the next iteration of the same insight, with contextualized retrieval addressing MemPalace's main failure mode. Both converge on the same thesis: write-time extraction is a premature optimization that sacrifices recall to save tokens.

In a world where tokens are cheap and recall errors are expensive, MemPalace made the correct bet.

References

Spatial Metaphors for LLM Memory: A Critical Analysis of the MemPalace Architecture, arXiv:2604.21284, Dey and Viradecha, April 23 2026
MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents, arXiv:2604.04853, Wang et al., April 6 2026
MemPalace GitHub Repository, MIT
lhl/agentic-memory MemPalace analysis (Issue #29 / ANALYSIS-mempalace.md) — the independent audit that identified the benchmark attribution issue
LongMemEval: Benchmarking Long-Context Language Models on Long-Term Conversations, Wu et al., ICLR 2025 — the benchmark used for MemPalace's headline score

MemPalace (MIT, 47,000+ GitHub stars in two weeks, April 2026) is a local-first AI memory system using verbatim storage with zero LLM calls at write time, ChromaDB + SQLite, and a five-level spatial hierarchy (Wings/Halls/Rooms/Closets/Drawers) that operates as metadata filtering in a single ChromaDB collection. Its 96.6% LongMemEval Recall@5 (raw mode) is attributable to ChromaDB's default all-MiniLM-L6-v2 embedding model on uncompressed verbatim text, not the palace hierarchy (which regresses performance in isolation per independent M2 Ultra replication). Genuine contributions: verbatim-first storage philosophy beating extraction-based competitors, ~170-token wake-up cost, fully deterministic zero-API write path, and first systematic spatial memory metaphor application. AAAK compression drops recall to 84.2% (-12.4pp). Mem0's April 2026 update closed to 93.4% LongMemEval. MemMachine (arXiv:2604.04853) achieves 93.0% accuracy on LongMemEvalS with contextualized retrieval and ~80% fewer input tokens than Mem0.

Sponsored Ad

If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀

No theory. No slides. Just pipeline.

Most founders know their product. Few know how to get it in front of the right people. In this hands-on session, Clay + HubSpot for Startups walk you through ICP definition, prospect list enrichment, and AI-personalized outreach. You launch your first sequence before the session ends. June 18. 11am ET / 4pm GMT.