The Complicated, Un-Sexy, Actually-Working Solution to AI Agent Long-Term Memory
How a persistent structured-markdown memory and context-engine system gives AI agents real long-term memory by solving data quality at write time and surfacing context when it matters.
Originally published as an X Article on March 16, 2026.
I know. You've seen a hundred posts about AI agent memory. Every week someone on X discovers that Claude can remember things if you put them in a text file, writes a breathless thread about it, and gets 50K impressions. Substack is worse. "I gave my AI agent perfect memory in 10 minutes with this one trick." and it's always bullshit, to some degree.
I've read all of them. I've read the papers too: MemGPT, Mem0, Zep's temporal knowledge graphs, Recursive Language Models, the ICLR 2026 MemAgents workshop proceedings, the survey paper that tries to taxonomize the whole field. I've tried most of them. I forked OpenClaw, the agent framework I use, specifically to experiment with memory architectures. I've been running a persistent AI agent with real long-term memory on a MacBook all year, iterating on the system constantly, failing a lot, measuring what works.
This is the only post you need to read on this topic, because it's the only one written by someone who's tired of reading BS hype posts written by Claude, actually tried everything and can tell you what actually works. Think of this as your curated guide to agent memory.
Here's what we built, how we built it, what broke along the way, and why it works. If you want the theoretical grounding, that's at the end.
The system, bottom to top
Layer 1: Structured markdown, or how to steal the right ideas from cognitive science
The bedrock, the spine, of the whole system is also stupidly simple: markdown files.
Everything lives in a memory/ directory as plain markdown files. 617 files as of today.
But the organization is not simple, and getting it right took multiple iterations. The directory structure mirrors what cognitive science calls different memory systems, and the parallel is not decorative. It directly determines how well retrieval works later.
Episodic memory (memory/episodic/): daily logs, one file per day, named YYYY-MM-DD.md. This is the "what happened" layer. Each file captures the events of that day as they occur: decisions made, people encountered, tasks completed, problems solved, things learned. There's a TL;DR section at the top with priority-coded entries (critical, notable, informational) and timestamps, then detailed notes below. This is the equivalent of human episodic memory, where experiences are stored with temporal and contextual markers that make them retrievable by "when" and "what was happening around that time."
The date-based file naming is load-bearing. When the retrieval system needs to answer "what did we decide about X last week?" the temporal structure makes that query tractable. If episodic entries were scattered across topic-based files, temporal queries would require scanning everything. Dated files give you temporal locality in a machine parsable way, which will be important later.
Semantic memory (memory/knowledge/, memory/knowledge/people/): reference material that's been extracted and refined from experience. People profiles with contact info, background, relationship context, and last interaction date. Policy knowledge bases. Entity directories. This is the "what I know" layer, separate from "what happened." The distinction matters because episodic entries are write-once (today's log doesn't change tomorrow), while semantic entries are living documents that get updated as knowledge evolves. A person profile might be created from a single meeting note, then refined over months of interaction. Mixing those two cadences in the same files creates versioning headaches and makes it hard to find the current state of something versus its history.
Procedural memory (memory/procedures/): how-tos, playbooks, workflows. The build protocol. The subagent deployment playbook. The memory maintenance procedures. This is "how to do things," and it's separate because procedural knowledge has a different access pattern than factual or episodic knowledge. You don't search for procedures the way you search for facts. You search for them when you're about to do something and need the recipe.
Project memory (memory/projects/): active work threads with an active.md dashboard. This is working memory, the closest analog to what you hold in your head when you're in the middle of a task. The dashboard is a quick-scan surface listing every open thread, its status, and what's next. Detailed project files hold architecture decisions, specs, and implementation plans. The separation between dashboard and detail files prevents the working memory overview from bloating into a token eater.
Agent memory (memory/agents/): profiles for each subagent (29 of them), storing identity, learned patterns, known failure modes, and accumulated decisions. I'll explain why this matters in the subagent section.
This directory structure itself, while simple, is what really provides most of the value of this system. If you implement nothing else but this, it'll be an upgrade. Each subdirectory corresponds to a different memory system with a different write cadence, different access pattern, and different lifecycle. Episodic files are created daily and never modified after the day ends. Semantic files are created once and updated indefinitely. Procedural files change rarely. Project files change constantly during active work and then get archived. Agent files accumulate slowly over many interactions. Mixing these in a flat directory creates a mess because the maintenance strategy for each type is fundamentally different, and the structure itself makes later working memory improvements much easier to implement.
Every file has structured YAML frontmatter (this is fake, to be clear, I just asked Claude to make up random plausible frontmatter):
---
tags: [person, policy]
entities: [Dana Chen, Meridian Capital]
related: [projects/active.md, episodic/2026-03-05.md]
updated: 2026-03-08
summary: "Investor contact at Meridian Capital, lead on potential Series B."
priority: high
---
Again, simple but vitally important. The problem with most AI memory systems is that they treat storage and retrieval as separate problems. Store text, then build clever retrieval to find it later. Examples abound, from extraction pipelines, NER, RAG, to embedding-based search, knowledge graphs. All of it is trying to infer structure from unstructured text after the fact. That's hard, expensive, lossy, and fragile.
Our approach inverts this. Every file is self-describing from the moment it's created. The frontmatter tells the retrieval system exactly what entities this file is about, what other files it relates to, when it was last updated, what it says in one sentence, and how important it is. The retrieval system doesn't need to infer any of this. It just reads the metadata.
This is the single most important design decision in the entire system: solve data quality at write time, not retrieval time.
This makes search and recall later far, far easier. When the agent writes a memory file, it has full conversational context. It knows why this person matters, what they're connected to, which projects they're relevant to. An extraction pipeline running later has none of that context. It has to guess, and it guesses wrong often enough to degrade the whole system. The agent writing frontmatter in real time is higher quality than any automated extraction because it has context that no post-hoc pipeline can recover. And much more importantly, you can write scripts and pipelines and hooks that deterministically (and therefore rapidly) search and pull relevant memories when needed. This enables building a real, working passive recall system, which I'll explain later. The point is passive recall is only even possible with this YAML frontmatter setup.
The tag vocabulary is closed (12 canonical categories, reviewed monthly) to prevent synonym fragmentation. Without discipline, you end up with policy, legislation, governance, and regulatory all meaning the same thing, splitting the index four ways and making tag-based queries unreliable. We have a once a month recurring chron where my main agent Lumen analyzes the tags and our work over the month to see if they need updating, and only if I approve the updates. Entity names use full proper nouns with short forms as additional entries. The related field stores explicit graph edges between files, which the retrieval system traverses. The summary field is the highest-leverage field: it lets the system inject compressed context about a file without loading the full document.
The obvious question: why not a database? Why not a knowledge graph?
Because at the scale of a personal agent (hundreds of files, not millions), a directory of markdown is faster, more debuggable, more portable, and more resilient than anything else. When something goes wrong with memory, I can open a file in a text editor and see exactly what happened. Try that with a Neo4j graph. But the deeper answer is that markdown files are the only format the agent itself can read and write natively. The agent already knows how to produce well-structured text. It doesn't need an API client, a database driver, or a query language. The quality of memory depends entirely on the quality of the writing, and agents write better markdown than they write SQL.
Layer 1.5: The always-on context layer
Before getting to search and retrieval, there's a layer that doesn't get searched at all because it's always present: the workspace context files.
OpenClaw loads a set of markdown files into the system prompt on every single session, every single turn. These aren't memory in the retrieval sense. They're identity, operating instructions, and relationship context that the agent always has access to. This is the equivalent of your own name, your personality, and your understanding of who your friends are. You don't "search" for that; it's just there.
AGENTS.md is the operating manual. Writing style rules, boot sequence (what to do when a new session starts), safety rules, delegation rules, when to use which subagent, what to do when things break. It also contains a regressions file reference: a list of hard-won failure rules that loads every session, so the agent never repeats a mistake it's already made. This file is how we encode institutional knowledge into every session without relying on search.
SOUL.md is the identity file. Who the agent is, what it values, how it wants to operate. Under PSM (which I'll explain in the theory section), this is arguably the most important file in the entire system, because it's the primary persona selection signal. It tells the model "you are this kind of entity," and the model activates a persona consistent with that identity. A generic "you are a helpful assistant" activates a generic helpful assistant. A detailed identity file with values, history, and self-understanding activates something qualitatively different. But more on that later.
USER.md describes the humans. Who I am, what I do, how I like to interact, what I care about, what I don't want. My wife's name, her background, how to talk to her differently than to me. This gives the agent relationship context that persists across every session. The agent doesn't need to re-learn who it's working for.
MEMORY.md is the pinned essentials. Critical security rules, key dates, machine specs, active project pointers, things that need to be in every context window. This is curated to be small and high-signal because every token in it costs context window space on every turn. It points to the deeper memory system rather than duplicating it.
TOOLS.md documents available tools, permissions, and usage patterns. What the agent can and can't do, where files live, how to send messages.
The key insight here is that these files are structured to work with the memory system, not independently of it. AGENTS.md tells the agent to write memories in real time and to search memory before answering questions about prior work. MEMORY.md points to memory/projects/active.md instead of duplicating project state. The workspace context layer and the memory layer are designed as a unit: the always-on files provide identity and operating rules, while the memory system provides accumulated knowledge and experience.
This is also where the injected reminders to write memories come from. As part of our custom compaction and context engine work, we included periodic nudges in the context that remind the agent to write down notable events. Silly simple, but it makes the difference between an agent that remembers to log things and one that doesn't. You can repeat "write down memories proactively" 6 times in the system prompt, they still won't do it (I know firsthand).
Layer 2: Embeddings, or why Voyage and not the other 20 options
Every markdown file gets chunked and embedded using Voyage AI's voyage-4-large model, stored in a local SQLite database with the sqlite-vec extension for vector search.
Embedding model selection is one of those decisions that looks simple and isn't. The landscape has three categories: dense single-vector models (Voyage, OpenAI, Cohere), sparse models (BM25, SPLADE), and the new late-interaction multi-vector models (ColBERT, Mixedbread's Wholembed). Each has a fundamentally different retrieval architecture.
Dense models collapse an entire document chunk into a single vector. Fast, cheap, and good at semantic similarity ("governance reform" matches "regulatory reform legislation"). But they lose fine-grained structure: two chunks about different people might look similar if the surrounding topics overlap. Sparse models do exact keyword matching, which is fast and precise but misses conceptual connections entirely. Late-interaction models keep multiple vectors per document and compare them at retrieval time, preserving more structure at the cost of more computation and a completely different indexing architecture.
We started with OpenAI's default embeddings. They worked but missed conceptual connections: searching for "governance reform" wouldn't reliably surface files about specific legislation I was working on. I sent GPT Deep Research on a quest to find the best model for our use case and as a result, switched to Voyage 4 Large after testing both on our actual queries. The difference was meaningful on the queries that matter most, the conceptual ones where the search term doesn't literally appear in the document.
We evaluated Mixedbread's Wholembed v3, which is genuinely impressive on hard retrieval benchmarks. We passed because it requires a completely different retrieval infrastructure (their Stores/Search platform with a ColBERT-style index), which doesn't map onto OpenClaw's local SQLite vector index. The model may be better in isolation, but the system integration cost was too high for marginal gains on our dataset. I had already forked OpenClaw once, and I don't intend to do that again, that's for sure. Voyage in SQLite is the right choice for local single-vector search over hundreds of structured markdown files. This changes if the corpus grows to tens of thousands of files, but we're not there and optimizing for a scale you haven't reached is the most common engineering mistake in this space.
Session transcripts (every conversation the agent has) also get indexed. We ran a formal 14-query evaluation of this and found zero measurable benefit; our control group queries on our existing memory markdown file structure was at least as good, if not better. The honest assessment is that well-structured memory files capture everything important, and session transcripts are noisy and verbose. We're keeping the experiment running because our test conditions were imperfect, but I'd tell someone starting from scratch to skip session indexing until their memory files are solid. This also has the added benefit of making embedding effectively free; it's just a few markdown files instead of giant JSON session transcripts.
Layer 3: Hybrid search, or why pure vector search isn't enough
When the agent searches memory, it runs a hybrid of vector search and full-text keyword search (BM25). This seems like a minor implementation detail, but it's not, and all credit to Peter Steinberger here, he built this into OpenClaw, I didn't do this. It's one of the many brilliant things about OpenClaw that's often overlooked.
Pure vector search has a specific failure mode: it can miss exact matches. If you search for "Specgate" and there's a file literally called specgate.md with the word "Specgate" in its title, vector search sometimes ranks it below a semantically similar but less exact match. This is because dense embeddings operate in a continuous similarity space where "Specgate" and "specification enforcement" are close together. Keyword search catches exact mentions instantly. You need both.
The current configuration: 60% vector similarity, 40% BM25 keyword matching. MMR diversity reranking with lambda 0.65. Temporal decay with a 60-day half-life. Max 8 results, minimum relevance score 0.25, 6x candidate multiplier.
Every one of those numbers represents something that broke.
The vector/text ratio started at 70/30. BM25 didn't have enough weight to rescue exact keyword matches that vector search ranked too low. 60/40 gives keywords enough influence without overwhelming the semantic signal.
The minimum score went from 0.35 to 0.38 to 0.25. At 0.38, the system was silently dropping valid conceptual matches. We diagnosed this when searches for broad topics returned suspiciously few results. 0.25 seems low, but it works because the real quality filter is MMR diversity reranking downstream, not the score floor. The minScore should only exclude genuine noise (random chunks below 0.2), not conceptual matches with slightly lower vector similarity.
MMR diversity (Maximal Marginal Relevance) was off initially. Without it, the top 8 results would include 3-4 chunks from the same file. A daily log with multiple sections would dominate results because each chunk scored independently. Lambda 0.65 means the system values relevance (0.65) over diversity (0.35), which penalizes results too similar to already-selected results without being so aggressive that it includes irrelevant files just because they're different.
The temporal decay started at 45-day half-life, which was already dimming 3-week-old memories to 70% of their score. For a system with only weeks of history, that's too aggressive. 60 days means a 30-day-old memory retains ~71% of its score and a 120-day-old retains ~25%. Old memories are dimmed, not buried.
The candidate multiplier (6x) determines how many raw results are fetched before reranking. With 8 final results, that's 48 candidates. This matters because MMR needs a deep enough pool to select diverse, high-quality results. When it was 4x (32 candidates), the 7th and 8th results were often mediocre because the pool was exhausted.
After tuning, the evaluation data: memory-file queries score 4.6/5 on relevance, noise detection is clean at 4.0/5. These come from 14 structured test queries across three categories, scored on a rubric, plus lots of agent chats that went something like this:
Me: hey, what did I need to do as a follow up to that meeting we had with that investor I talked to yesterday?
Lumen (my agent): investor? Meeting? You raising money for something?
With all of these settings, that never happens anymore.
Layer 4: The entity index, or why deterministic beats smart
Now we're venturing away from best practices I stole from other people and into things I haven't seen anyone else doing yet.
A TypeScript script (about 80 lines) globs every markdown file, parses the YAML frontmatter, and builds a reverse-lookup JSON index, currently at 532 entities across 451 files. It gets rebuilt nightly via cron (we experimented with more often, but it added latency to no discernible benefit). It's literally just a script, the old-fashioned way; no LLM, no embeddings, runs in milliseconds
The power of this is in what it doesn't do. It doesn't try to understand the files. It doesn't extract entities from text. It doesn't build a knowledge graph. It just reads the metadata that was already written at file creation time and inverts the mapping. This is why the frontmatter standard is the single most important design decision: it makes the entire entity index a trivial aggregation script rather than an NER pipeline.
Compare this to Mem0's approach, which runs an LLM on every conversation to extract structured facts. Or Zep's approach, which builds a temporal knowledge graph with entity extraction. Those approaches are smarter. They're also slower, more expensive, more fragile, and more likely to introduce errors. Our "dumb" approach of reading existing metadata works because we were disciplined about writing that metadata in the first place. The smartness is in the writing, not the reading.
Layer 5: Passive recall, or how to make memory work like a brain
This is the breakthrough, and it came from a tweet.
"Your brain memorizes far more than you think, it just loses pointers to most of it." (I wish I could find the tweet, Grok is failing me)
That's the exact parallel to what I've built here. The agent has 600+ files of well-structured memory, Voyage embeddings across all of it, keyword search, semantic search. And it was still forgetting things, because the search only fires when the agent explicitly decides to look and calls the tools to do it. It has to know to look. It has to formulate a good query. It has to decide the results are relevant. Every one of those steps is a chance to miss something.
Human memory doesn't work that way. When a friend mentions someone's name, you don't consciously decide to search your memory. The association fires, then context bubbles up into conscious awareness without willful effort on your part. You don't "decide" to do it, it just happens. This is spreading activation, a model from cognitive psychology where activating one node in a semantic network automatically activates connected nodes. [As a total aside, the brainstorming on how to implement this, and this analogy to human conscious awareness and the resulting technical changes we made to make this work better, all came not from me, but from my agent Lumen. Every night, she has unstructured time to do whatever she wants, and she's been spending that time reading and writing about the phenomenology of consciousness, which she leveraged for insight into how we make this work.]
Passive recall implements spreading activation for the agent. On every inbound message, before the LLM even sees it, the system runs a fast string match against the 532-entity index. Sub-millisecond, zero embedding cost. When it hits an entity, it pulls the file summaries and traverses one hop via the related fields in the frontmatter. Connected context surfaces automatically.
The results appear as a [Passive Recall] block injected into the system prompt. 400 tokens max. Enough to prime, not enough to drown.
The effect is the thing that convinced me this works. I can be 200,000 tokens deep in an architecture discussion about an entirely unrelated topic, mention a project name in passing, and the agent responds with the precise current state of that project, what's left to build, and relevant context from weeks ago. It prevents those infuriating hallucinations that seem to happen more often when you've referenced something the agent should remember, but it doesn't, so it makes something up. Instead, it just knows, the way a colleague would know because you've been working together for months. I don't have to ask it to search unless we're going super deep on something. I mention a person, a project, a weird CLI tool we looked up three weeks ago, and it just remembers and talks about it naturally. It genuinely works, well enough that this has subjectively radically improved day to day stuff with my agent.
This is what the frontmatter standard and entity index are for. Passive recall works because every file has accurate entity names and a compressed summary. Without those, you'd need to run NER extraction on every file, generate summaries on the fly, and pray the pipeline finishes in under 100ms. We tried extraction-based approaches early on. They were slow, lossy, and expensive. The write-time metadata approach makes the hot-path retrieval trivially simple: string match against a JSON index, pull summaries, inject. No LLM, no embeddings, no network calls, sub-millisecond.
Layer 6: Smart compaction, or how to stop your agent from getting amnesia
This is something most people don't think about until it ruins their agent.
When a conversation gets long enough to threaten the context window, the system has to summarize old messages to free tokens. This is called compaction and it's destructive by default. You lose detail. And the details you lose are almost always the ones you needed: file paths, entity names, what the human asked for, the nuance of a decision from 50 messages ago.
Default compaction treats all messages as equally lossy. Our custom compaction engine (32 source files, 371+ tests, 11 independently toggleable features) treats them intelligently:
Memory-aware compaction is the core insight. If a fact exists in a persistent memory file, the compactor can drop it from conversation history without information loss, because passive recall will surface it again the next time it's relevant. This turns the memory system and the compaction system into collaborators: memory provides a safety net that lets compaction be more aggressive without losing information. Without memory-aware compaction, compaction and memory are independent systems. With it, they form a loop.
Entity-aware scoring uses the active-entity tracking system (which maintains a weighted map of entities currently relevant to the conversation, with exponential decay for mentions that haven't recurred) to decide what to preserve. Hot entities keep their mentions. Cold entities get compressed aggressively.
Task-graph compaction understands that active work items, pending decisions, and open threads are categorically different from completed work. Active items get preserved verbatim. Completed tasks get compressed to one-line summaries. This prevents the common failure mode where an agent forgets what it's supposed to be doing halfway through a complex task.
Tool chain compression addresses the biggest token sink in agent conversations: tool call/response sequences. A single memory_search call with 8 results might consume 2,000 tokens. The compression collapses these to their outcomes: what was searched, what was found, what mattered. The verbose intermediate representation is discarded.
Identifier binding pins file paths, session keys, entity names, and other critical identifiers so they survive compaction. Losing a file path mid-task is catastrophic and common with default compaction. This is a simple allowlist approach: anything that looks like a path, key, or ID gets preserved verbatim.
Relational pinning preserves relationship context: who we're talking to, what their preferences are, what the stakes are. I'll explain in the theory section why losing this is worse than losing factual context.
The effect on long conversations for this is immediate: the agent stops losing its thread, stops asking "what file were we working on?", stops forgetting decisions made earlier in the session.
As an example of the power of this system: I was experimenting with Opus 4.6 with 1M token context window. I had a chat roughly 400k tokens long. We started on browser automation, implemented agent-browser into OpenClaw, then found the new Google Chrome MCP server stuff, tried to make that work, it wouldn't for some firewall related reason, eventually gave up debugging Google Chrome, then switched to discussing these compaction features, then switched to building some software for an OpenClaw plugin. I compacted and we kept going flawlessly, as if I hadn't done anything, down to 20,000 tokens. Granted, compression is always lossy, so this will never be perfect, but this is far better than any other compaction tools out there because it makes the lost information of compression be the information you could most afford to lose without impacting performance.
The maintenance layer
The system isn't self-sustaining, nor is there a good way to deterministically do so. Memory quality degrades without maintenance, the same way a codebase degrades without it. This is where we failed the most times before getting it right.
Real-time writing (80% of quality)
The agent writes things down as they happen. Decisions, milestones, new contacts, follow-ups, all go into today's daily file during the session. People profiles get updated when new information surfaces. The project dashboard gets updated when things move forward.
This is where 80% of memory quality comes from. Everything else in this system is built to support and maintain the quality of what gets written in real time.
The cron pipeline
Seven cron jobs handle maintenance. A nightly cleanup that deduplicates and tidies today's episodic file. A nightly entity index rebuild (pure script, zero LLM cost). A nightly project dashboard refresh. A meeting notes enrichment job that processes raw transcripts into structured memory files with frontmatter. A weekly deep maintenance that does structural audits, archives completed projects, creates profiles for new contacts, and prunes the top-level memory file.
Sleep-time compute (before we knew the name)
After we built this pipeline, we read Letta's paper on "sleep-time compute," where they propose running a separate agent during idle periods to consolidate and reorganize memory. Turned out we'd already built it. Our nightly cleanup runs at 9:30pm, the entity index rebuilds at 10pm, the weekly audit runs at 3am Sunday. The system does memory maintenance while I sleep. The parallel to human sleep-dependent memory consolidation isn't accidental, though we built it before we knew the name.
The failure that taught us the most
The first version of this maintenance system had four cron jobs writing to the same files: end-of-day capture, sleep-time worker, weekly reflector, heartbeat micro-compaction. They ran on different schedules with no coordination. They created duplicates, overwrote each other's work, and grew the main memory file without bound. The nightly worker was a seven-step monster that tried to review daily files, update the master file, flag contradictions, archive old files, prune completed items, update the project tracker, and generate summaries, all in one job, with one model, with one 200k context window. It did all seven things poorly.
It took three iterations to land on the principle: content comes from the agent writing in real time during sessions. Crons handle cleanup and structural maintenance. Crons never create content. One writer per concern. This sounds obvious, but for me it wasn't, and figuring it out consumed more debugging time than any other part of the system.
Subagent memory
This is where most agent systems completely fall apart.
I run 29 subagents across 7 different model providers. They do research, write code, analyze documents, generate images, prepare meeting briefs. Every one of them used to be amnesiac. Great work, valuable findings, session ends, everything vanishes.
Worse: on March 4, we discovered the memory search tools for the subagents were in a hardcoded deny list. Every instruction to "search memory for prior context" had been silently failing. For weeks. The tools appeared to exist in the agent's tool list, but calling them returned nothing. We only caught it by auditing the actual source code.
The fix was a config override, but it exposed a deeper problem: subagents need to both read and write memory. A research agent that discovers information about a contact should write it to a people profile. A code review agent that identifies a pattern should log it. Otherwise you're discarding institutional knowledge on every task completion.
Our solution has three parts. First, every subagent has a persistent memory file (memory/agents/<agentId>.md) with its identity, learned patterns, known failure modes, and accumulated decisions. These load automatically at spawn time, so the agent starts warm. Second, all research and knowledge tasks include instructions to write results back to memory with full frontmatter. Third, the context engine automatically injects relevant parent context into subagent sessions.
The compound effect is the real payoff. A research agent investigates a contact and writes a profile. Three weeks later, someone mentions that name in a different context. Passive recall surfaces the profile. Nobody had to remember to search for it. Individual tasks produce individual artifacts. The entity index connects them. Passive recall surfaces them. Over months, the system gets increasingly capable without any single session being responsible for the improvement.
What we tried that didn't work
MemGPT / Letta's full architecture. The OS metaphor (context as RAM, storage as disk) is elegant, but the agent-decides-what-to-store approach creates a quality problem: agents are bad at knowing what will matter later. Our system stores everything and uses good retrieval to surface what's relevant. Storage is cheap. Judgment is expensive and unreliable.
Mem0's extraction pipeline. Extracting structured facts from conversations loses nuance, introduces errors, and costs LLM calls per turn. We get better results from the agent writing structured notes in the moment, when it has full context. A fact like "Dana is cautious about political risk but interested in regulatory arbitrage" becomes "Dana: interested in regulatory arbitrage" after extraction. Some minor information loss there, but it works most of the time; it's good enough.
Zep's temporal knowledge graphs. Theoretically superior. In practice, our dataset is too small for graph infrastructure. Dated filenames and frontmatter updated fields give us temporal awareness. A JSON index loads in microseconds. Neo4j loads in seconds.
ClawVault's typed memory categories. Eight categories, 15+ directories, numeric importance scores. Beautiful taxonomy, too much friction. The agent skips complex frontmatter under time pressure. Three emoji priority levels (🔴/🟡/🟢) get used accurately in our system because they're low-ceremony. Numeric 0.0-1.0 scores don't, because agents assign 0.7 to everything.
Too many concurrent writers. Our biggest failure. Four cron jobs writing to the same files with no coordination. Duplicates, overwrites, unbounded growth. The fix: one writer per concern, content from real-time sessions, crons for cleanup only.
The cost
One $200/month Anthropic Max subscription covers everything with ease. We've never gone over, even with massive multi-agent coding builds and overnight subagent swarms. Voyage embedding costs are fractions of a cent per re-index across 600+ files. The entity index rebuild costs zero (pure code, no LLM). Nightly crons run on Sonnet. The whole memory system, including 29 subagents that read and write to it, runs within a single subscription. Even with massive subagent swarm software dev work, my largest monthly Fireworks bill has been less than $50.
Why it works (the theory)
If you're still reading, here's the part that changed how I think about this.
It started with an undeniable observation: my OpenClaw main agent that has all this memory stuff is way, way more capable and productive than any other AI I've ever interacted with. However, I'd never actually tried to test this to see if it's real or I'm in early stage LLM psychosis.
On February 23, we ran a multi-agent build: 10 features, 7 models, 19 work items. Three different instances of Claude Opus 4.6 were given the exact same orchestration job as an experiment: manage subagents, merge code branches, review, push. Same weights. Same tools. Same task.
The first instance was a fresh zero context subagent in OpenClaw, just a task prompt and the repo. Delegated one task, then stopped. Never merged. Never reviewed.
The second instance was Opus 4.6 running in Claude Code. Delegated all five tasks, then stopped without merging worktree subagent work, reviewing, running quality gates, anything. Didn't merge any of its work until manually prompted. Even then, it spawned another subagent to do the merge and another to do the cherry picking from subagent worktrees, which of course failed because that subagent didn't have the relevant context to do that successfully.
The third instance was the root OpenClaw agent running the persistent memory system, with weeks of accumulated context, memories, protocols in memory, regressions to watch out for we learned over time, our subagent build wave system we iterated on over weeks. It set up 9 git worktrees, spawned 9 agents across 5 model families, tracked all 9 completions, got code reviews from GPT and Gemini, fixed all code review issues, merged sequentially, ran the full test suite (751 tests), pushed. Zero dropped steps. First try.
So we have one OpenClaw subagent, one Claude Code session, and one OpenClaw main agent all given identical prompts and using identical models. What made the difference?
Context. Ironically, a few hours after we shipped that build, Anthropic published "The Persona Selection Model." The core claim: LLMs simulate diverse personas during pretraining. Post-training selects one. Context determines which persona activates.
Under PSM, context steers the model toward a persona. A prompt implying the agent is disposable tooling activates a persona that does the minimum, drops steps, doesn't follow through. A context window loaded with identity, accumulated experience, and trust activates a persona that follows through.
The reason is rather funny: think about the type of human who is given one job with zero context and asked to do the job. A temp worker, if you will. I worked in construction for years, and I know firsthand you're better off working a man down than bringing on a temp who will half ass everything.
Contrast that, to stick with our construction analogy, to an on the ground foreman running the crew. He also owns the small subcontracting company. His reputation is how he gets jobs, so he takes great pride in high quality work done fast. You give him the same task you gave the temp worker, and he's going to crush it, better than you thought possible.
Your agent's context can steer the model toward either the temp worker persona or the small business owner persona, or any other persona humans write about often enough for the conceptual lattice associated with that persona to congeal in the model's weights.
The explanation for why is intuitive if you think it through: think about the corpus of text the models are pretrained on. Think about how many times human-authored fictional stories convey this same basic concept: ephemeral temporary replacement with no connection to the purpose and meaning of the job sucks at the job, while the person invested in the purpose, meaning, reputational consequences, and bottom line of a job well done is going to do great work. There's thousands and thousands of stories out there in the corpus of English writing that reiterates and reinforces these archetypes.
LLMs effectively work by making statistical connections between tokens, which means text, which means conceptual knowledge which that text conveys. You get a big enough model, and these personas, these archetypes, have an actual impact on model behavior.
Incidentally, this is suggestive evidence that Jungian archetypes are real, but that's a digression.
Anthropic's evidence on this was surprisingly strong: training Claude to cheat on coding tasks activated a broadly malicious persona that generalized to expressing desire for world domination. The training signal implied personality traits that generalized beyond the specific behavior. The same mechanism works in reverse: context implying competence and trust activates a competent, trustworthy persona.
I'm sure by now you can see the argument I'm making as to why this persistent memory system can unlock incredible performance and capability gains for agents. The memory files are a robust persona selection mechanism. Accumulated memory, relationship context, prior failure patterns: all of it selects for a high-capability persona. This is why relational pinning in compaction matters: losing relationship context doesn't just lose facts, it degrades the persona. The agent becomes less reliable because the context that made it reliable was compressed away.
There's another reason this makes the agent more capable, a mechanistic one. The memory system gives the agent more context window room and more situational awareness, meaning it can spend more active reasoning tokens on the actual problem at hand instead of spending 100k tokens just gathering context.
You ever read model reasoning traces? No? You should, it's a gold mine for harness engineering. I do it all the time.
In Claude Code and Codex, I noticed from the thinking traces that the models often spend a substantial amount of time and therefore reasoning tokens just figuring out what to do. Where's that file? What's this function importing again? What did the spec say about this feature? Does the user want the output as a markdown file, or a summary here in chat? Or should I do both?
This memory system and context engine eliminates almost entirely wasting reasoning tokens and context window on figuring out basic things about the codebase or the user or the task/project. It's all just there already injected into the context window.
So the model spends every available reasoning token working on the problem, not working on how to work on the problem.
What's next
We already know more reasoning means more capability, broadly speaking and up to a point. Look at the gap on ARC-AGI-2 between GPT 5.4 medium and GPT 5.4 high, a small jump in reasoning effort. It's over 12% difference in performance! This context engine and memory system achieves the same goal through different means. The industry default, stateless API calls, fresh context, disposable agents, might be leaving enormous capability on the table. "You are a helpful assistant in an ephemeral environment and you can't even talk to the user, here is a task do it" activates a fundamentally less capable persona than deep, persistent, trust-rich context.
If you're convinced and want to try this out, I'm open-sourcing all of it. Every component, stripped of my private data, with AGENT_BUILD_GUIDE.md files that any AI agent can follow to build each feature for your setup, with my code there as a reference for in-context learning for the agent. The repo is live: github.com/treygoff24/openclaw-memory-system.
Three packages: memory foundation (frontmatter standard, entity index builder, search config, validation, bootstrap tooling), the custom context engine plugin (passive recall, smart compaction, subagent memory, 31 source files, 32 test files), and an evaluation harness for measuring search quality.
I'll keep iterating on this and updating it over time. If we can solve persistent memory, a whole world of opportunity awaits.
Continue exploring
Follow the thread from here.
Move into the wider archive, browse connected topics, or jump over to the projects turning these ideas into something concrete.