Agent Memory Architecture: Working, Episodic, Semantic

CallMissedMay 8, 2026

·5 min readGuide

AI Agents Architecture Memory Engineering

"Agent memory" is one of the most overloaded terms in the field. People mean radically different things: a chat-history buffer, a vector store of past sessions, a fact graph, or some custom hybrid. This matters because picking the wrong memory shape for the wrong job is the most common reason agents that demo well don't ship.

The cleanest mental model borrows from cognitive science. The field has converged on three or four memory types that map fairly directly to what your stack needs.

The four memory types

Working memory is the current conversation context — the messages, system prompt, and anything you've explicitly stuffed into the LLM call. It's bounded by the model's context window and is the cheapest, fastest memory layer. Most "memory bugs" people report are actually working-memory truncation problems.

Procedural memory is the system prompt and decision logic that defines how the agent behaves. It's typically static or version-controlled, not learned at runtime. Treat it as code, not data.

Semantic memory holds general factual knowledge: user preferences ("Anuj prefers Python"), domain entities ("the project uses PostgreSQL 16"), and stable relationships. It changes slowly and is best stored in a structured store (Postgres, Neo4j) with explicit schemas.

Episodic memory is timestamped records of past interactions — "the user reported the bug on Tuesday and we tried fix X." Vector databases are the standard backing store, retrieved by semantic or hybrid search.

The most-cited live example is MemGPT / Letta, which models the agent's memory as an OS: in-context "core memory" (RAM-like), "recall" memory (the conversation database), and "archival" memory (long-term searchable storage). The agent uses tool calls to read, write, and migrate between layers — what Letta calls "agentic memory control."

Why the distinction matters

The frequent mistake is to dump everything into one vector store. Three failure modes follow:

Stale facts overwrite fresh facts. If "the user prefers Python" sits next to "the user said they want to learn Rust," semantic search returns whatever scored higher, not whatever is currently true.

Context bloat. Returning the top-k episodic chunks every turn drowns out the system prompt and degrades reasoning.

No write policy. Without a rule for what gets stored, the memory grows linearly with every turn, retrieval gets noisier, and latency climbs.

Splitting episodic from semantic forces you to answer the right question for each store: episodic is "what happened, when?" semantic is "what is currently true?"

Write policies

The write step is the part most teams underbuild. Common policies:

Always-write episodic. Every turn, persist a summary embedding. Cheap, but noisy.

Reflection-based. After a session, run a small "reflect" call that extracts durable facts and either inserts new semantic memories or updates existing ones. This matches the MEMORY rules many agent harnesses already use ("store facts about preferences, decisions, recurring failures, strategies that worked").

Importance-weighted. Tag every memory with an importance score (0–1) and a category that controls decay. Low-importance memories get pruned faster. Letta and several DIY harnesses use variants of this.

A mature memory layer separates store (write a candidate memory), update (merge into an existing one), and ignore (filler / questions) as distinct decisions, not a single "save everything" path.

Read policies

On the read side, the rule is "retrieve once, retrieve narrowly." The default of dropping the top-10 episodic chunks into context every turn does more harm than good past short conversations. Better patterns:

Recall on intent change. Detect when the user pivots topics; retrieve relevant memories then, not on every turn.

Tool-gated retrieval. Expose recall_memory(query) as an explicit tool. The model decides when to call it. This is the Claude Code / Letta pattern.

Recency-blended scoring. Pure cosine similarity ignores time. Mix in a recency decay so a fact from yesterday outranks a near-duplicate from a year ago.

Conflations to avoid

A few traps:

Memory is not a long context window. Even with 1M-token contexts, stuffing the entire history every turn is expensive and degrades attention. Memory implies selective recall.

Vector search is not a database. [Speculation] Many production agents would be better served by Postgres + a good extraction prompt than by a pure vector store. Use vectors for fuzzy semantic recall; use structured tables for known facts.

Memory is not a knowledge base. A KB is curated content the agent retrieves to answer questions. Memory is the agent's own state, accumulated across sessions. Same retrieval primitives, different write policies.

A reasonable starting stack

For most production agents in 2026, this stack is enough:

Working memory: the current message thread, summarized when it nears the context budget

Semantic memory: Postgres with two tables — user_preferences and entities — written by a reflection step at end-of-session

Episodic memory: pgvector embeddings of session summaries, with importance and recency in the scoring function

Read interface: a single recall(query) tool the agent calls when it needs prior context

You can layer Letta on top later if you outgrow it; you almost certainly won't need to.

Frequently Asked Questions

Do I need a vector database for agent memory?

Not always. For semantic memory (stable facts, preferences), structured Postgres is often a better fit. Vectors earn their keep for episodic recall over many sessions where exact-match queries don't work.

How is agent memory different from RAG?

RAG retrieves curated knowledge to answer questions. Agent memory persists the agent's own observations and decisions across sessions. The retrieval primitives are similar; the write policy and ownership differ.

How do I stop memory from filling with junk?

Add a reflection step at end-of-session that picks what's worth keeping, tag every memory with importance and category, and decay or prune low-importance items. Writing every turn unconditionally is the most common cause of noisy retrieval.