AI Agent Memory Architecture: Long-Term, Persistent Cognition for Production Agents
Most agents forget everything when the session ends. Enterprise agents managing ongoing relationships, projects, and workflows need persistent memory — episodic, semantic, and procedural — that survives across sessions and scales with usage.
Persistent agent memory requires three distinct layers: episodic memory (vector-indexed interaction history for semantic recall), semantic memory (structured facts and preferences in a relational store), and procedural memory (learned patterns preserved via few-shot cache or fine-tuning). LangMem, released by LangChain in 2025, provides a unified API over all three layers. Memory consolidation — periodically compressing episodic memories into semantic summaries — is essential for managing scale without linear context growth.
Why Stateless Agents Fail Enterprise Use Cases
The default architecture for an LLM agent is stateless: each conversation starts fresh, with no knowledge of prior interactions. This is fine for one-shot tasks — generate this report, classify this document, answer this question. It is catastrophically wrong for enterprise agents that manage ongoing work: a procurement agent that negotiates with the same supplier weekly, a customer success agent that manages a 24-month enterprise relationship, or a project management agent that tracks work across dozens of parallel workstreams.
A stateless agent re-learns everything from context on every session. The user must re-explain preferences, prior decisions, and background context each time. The agent cannot improve its performance based on what worked and what did not in previous interactions. It cannot notice trends, track unresolved issues, or maintain commitments across sessions.
The memory problem is not just a UX convenience — it is an architectural requirement for agents that are supposed to replace ongoing human cognitive work. A human account manager remembers what was discussed in the last call. They know the customer's preferences, the outstanding action items, the history of issues. A persistent agent must do the same.
The Three-Layer Memory Model
Cognitive science distinguishes three types of human memory. The same taxonomy maps cleanly onto agent memory architecture:
Episodic Memory: What Happened
Episodic memory stores individual interaction records — what was said, what was decided, what tools were called, and what outcomes resulted. In agent architecture, episodic memories are stored as embeddings in a vector database, enabling semantic recall: 'retrieve the last 5 interactions where this customer raised a billing concern.' Redis Stack or Qdrant with TTL-based expiry handles episodic storage effectively. The key design decision is granularity: store at the message level (high fidelity, high storage cost) or at the interaction summary level (lower fidelity, cheaper). For most enterprise agents, interaction-level summaries (250-500 tokens per session) strike the right balance.
Semantic Memory: What Is Known
Semantic memory stores stable facts, preferences, and knowledge about entities the agent interacts with. For a customer success agent: customer industry, product tier, stated preferences, known constraints, decision-maker names, and relationship history. Unlike episodic memory (which is time-ordered interaction records), semantic memory is a structured knowledge graph that is updated as new facts are learned and old facts are superseded. Postgres with JSONB columns is the standard implementation — structured enough for reliable retrieval, flexible enough to accommodate evolving schemas. LangMem provides a semantic memory abstraction over Postgres.
Procedural Memory: How to Do Things
Procedural memory encodes learned skills and effective patterns. In human cognition this is muscle memory and automated behaviour. In agent architecture, it manifests as: few-shot examples of successful interactions cached for retrieval, prompt templates that encode learned best practices, or — for high-value production agents with sufficient data — fine-tuned model weights that have internalised domain patterns. Few-shot cache (storing successful interaction patterns as retrievable examples) is the most practical implementation for most enterprise agents. Fine-tuning is reserved for high-volume, well-defined tasks where the cost justifies the investment.
Memory Consolidation and Forgetting Strategies
Without consolidation, episodic memory grows unboundedly and retrieval quality degrades as signal-to-noise ratio drops. Two mechanisms manage this:
Memory Consolidation
Periodically compress episodic memories into semantic summaries. After every 10 sessions with a given entity, run a consolidation job: retrieve the last 10 session summaries, have the LLM extract persistent facts (new preferences stated, issues resolved, commitments made), update the semantic memory store with those facts, and archive the episodic records. LangMem's consolidation pipeline handles this automatically with configurable consolidation triggers.
Forgetting: TTL and Relevance Decay
Not all memories should persist indefinitely. TTL-based expiry removes episodic records after a configurable retention period (90 days is standard for most enterprise use cases). Relevance decay de-prioritises memories that have been superseded — if the semantic memory records that a customer preferred monthly billing but a more recent interaction records they switched to annual, the older preference should decay in retrieval ranking, not remain equally weighted.
Three-Layer Agent Memory with LangMem, Redis, and Postgres
import asyncio
import json
from datetime import datetime, timedelta
from typing import Optional
import redis.asyncio as redis
import asyncpg
from openai import AsyncOpenAI
from langmem import AsyncMemoryClient # LangMem 0.1+ (LangChain, 2025)
from qdrant_client import AsyncQdrantClient
from qdrant_client.models import PointStruct, Distance, VectorParams
import uuid
class ThreeLayerAgentMemory:
"""
Unified three-layer memory system for persistent enterprise agents.
- Episodic: Redis Streams + Qdrant (interaction history with semantic search)
- Semantic: Postgres JSONB (facts, preferences, entity knowledge)
- Procedural: Redis Hash (few-shot example cache)
"""
def __init__(
self,
redis_url: str,
postgres_dsn: str,
qdrant_host: str,
langmem_api_key: str,
):
self.redis_url = redis_url
self.postgres_dsn = postgres_dsn
self.qdrant = AsyncQdrantClient(host=qdrant_host, port=6333)
self.openai = AsyncOpenAI()
self.langmem = AsyncMemoryClient(api_key=langmem_api_key)
self._redis: Optional[redis.Redis] = None
self._pg: Optional[asyncpg.Connection] = None
async def connect(self):
self._redis = await redis.from_url(self.redis_url, decode_responses=True)
self._pg = await asyncpg.connect(self.postgres_dsn)
await self._pg.execute("""
CREATE TABLE IF NOT EXISTS agent_semantic_memory (
entity_id TEXT NOT NULL,
key TEXT NOT NULL,
value JSONB NOT NULL,
confidence FLOAT DEFAULT 1.0,
updated_at TIMESTAMPTZ DEFAULT NOW(),
PRIMARY KEY (entity_id, key)
)
""")
await self._ensure_qdrant_collection()
async def _ensure_qdrant_collection(self):
collections = await self.qdrant.get_collections()
names = [c.name for c in collections.collections]
if "episodic_memory" not in names:
await self.qdrant.create_collection(
"episodic_memory",
vectors_config=VectorParams(size=768, distance=Distance.COSINE)
)
async def _embed(self, text: str) -> list[float]:
resp = await self.openai.embeddings.create(
input=text, model="text-embedding-3-small", dimensions=768
)
return resp.data[0].embedding
# --- EPISODIC MEMORY ---
async def store_episode(self, entity_id: str, session_id: str,
summary: str, metadata: dict) -> None:
"""Store a session summary as an episodic memory."""
vector = await self._embed(summary)
point_id = str(uuid.uuid4())
await self.qdrant.upsert(
"episodic_memory",
points=[PointStruct(
id=point_id,
vector=vector,
payload={
"entity_id": entity_id,
"session_id": session_id,
"summary": summary,
"timestamp": datetime.utcnow().isoformat(),
**metadata
}
)]
)
# TTL via Redis (90-day expiry marker)
await self._redis.setex(
f"ep_ttl:{point_id}", int(timedelta(days=90).total_seconds()), "1"
)
async def recall_episodes(
self, entity_id: str, query: str, top_k: int = 5
) -> list[dict]:
"""Semantically retrieve relevant past episodes."""
from qdrant_client.models import Filter, FieldCondition, MatchValue
vector = await self._embed(query)
results = await self.qdrant.search(
"episodic_memory",
query_vector=vector,
query_filter=Filter(must=[
FieldCondition(key="entity_id", match=MatchValue(value=entity_id))
]),
limit=top_k,
with_payload=True,
)
return [{"summary": r.payload["summary"],
"session_id": r.payload["session_id"],
"timestamp": r.payload["timestamp"],
"score": r.score} for r in results]
# --- SEMANTIC MEMORY ---
async def store_fact(self, entity_id: str, key: str,
value: dict, confidence: float = 1.0) -> None:
"""Store or update a structured fact about an entity."""
await self._pg.execute("""
INSERT INTO agent_semantic_memory (entity_id, key, value, confidence, updated_at)
VALUES ($1, $2, $3::jsonb, $4, NOW())
ON CONFLICT (entity_id, key)
DO UPDATE SET value=$3::jsonb, confidence=$4, updated_at=NOW()
""", entity_id, key, json.dumps(value), confidence)
async def get_facts(self, entity_id: str) -> dict:
"""Retrieve all semantic facts for an entity."""
rows = await self._pg.fetch(
"SELECT key, value, confidence FROM agent_semantic_memory WHERE entity_id = $1",
entity_id
)
return {row["key"]: {"value": json.loads(row["value"]),
"confidence": row["confidence"]} for row in rows}
# --- PROCEDURAL MEMORY ---
async def store_successful_pattern(
self, pattern_key: str, example: dict, score: float
) -> None:
"""Cache a successful interaction pattern for few-shot retrieval."""
await self._redis.zadd(
f"procedural:{pattern_key}",
{json.dumps(example): score}
)
await self._redis.zremrangebyrank(f"procedural:{pattern_key}", 0, -11) # Keep top 10
async def retrieve_patterns(self, pattern_key: str, top_k: int = 3) -> list[dict]:
"""Retrieve top-scoring procedural patterns for few-shot prompting."""
raw = await self._redis.zrevrange(
f"procedural:{pattern_key}", 0, top_k - 1
)
return [json.loads(item) for item in raw]
# --- MEMORY CONSOLIDATION ---
async def consolidate_memory(self, entity_id: str) -> None:
"""Extract persistent facts from recent episodes and update semantic memory."""
recent = await self.recall_episodes(entity_id, "recent interactions", top_k=10)
if len(recent) < 5:
return # Not enough episodes to consolidate
episodes_text = "\n\n".join(
f"[{ep['timestamp']}]\n{ep['summary']}" for ep in recent
)
resp = await self.openai.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"""Extract persistent facts from these interaction summaries.
Return JSON: {{"facts": [{{"key": "str", "value": {{"summary": "str"}}, "confidence": 0.0-1.0}}]}}
\n{episodes_text}"""
}],
temperature=0, response_format={"type": "json_object"}
)
extracted = json.loads(resp.choices[0].message.content)
for fact in extracted.get("facts", []):
await self.store_fact(entity_id, fact["key"],
fact["value"], fact["confidence"])
# Usage example
async def main():
memory = ThreeLayerAgentMemory(
redis_url="redis://localhost:6379",
postgres_dsn="postgresql://user:pass@localhost/agentdb",
qdrant_host="localhost",
langmem_api_key="lm-..."
)
await memory.connect()
await memory.store_episode(
entity_id="customer_acme",
session_id="sess_2025101401",
summary="Customer raised concern about Q4 pricing. Agreed to review in January.",
metadata={"sentiment": "neutral", "topics": ["pricing", "renewal"]}
)
episodes = await memory.recall_episodes(
"customer_acme", "pricing concerns", top_k=3
)
print(episodes)
if __name__ == "__main__":
asyncio.run(main())Three-layer agent memory implementation using Qdrant for episodic vector storage, Postgres JSONB for semantic facts, and Redis Sorted Sets for procedural few-shot cache. The consolidate_memory method runs as a background job after every 10 sessions to extract persistent facts from episode history and promote them to semantic memory.
Memory systems introduce GDPR and data retention compliance requirements that are easy to overlook during development. Every episodic memory record containing user interaction data is potentially a personal data record under GDPR Article 4. Before deploying persistent agent memory in production, confirm: (1) your data retention policy defines maximum TTLs for episodic records, (2) you have a delete-by-entity-ID endpoint that purges all three memory layers on data subject erasure requests, and (3) semantic memory facts derived from personal data are classified accordingly.
Memory Architecture Design Principles
- Separate episodic and semantic storage from the start — retrofitting them onto a single store later is more painful than a slightly more complex initial design.
- Design memory consolidation before you design memory storage. Consolidation is the mechanism that keeps the system scalable; without it, episodic memory becomes a liability.
- Implement a memory retrieval test harness before connecting memory to an agent. Verify that the retrieval system surfaces relevant memories and not noisy ones.
- Version your semantic memory schema. Facts change, and you need to be able to migrate existing records when the schema evolves.
- Monitor memory hit rate in production — the percentage of agent sessions where memory retrieval surfaced at least one relevant record. A hit rate below 40% indicates retrieval quality issues, not a storage problem.
Inductivee's Memory Architecture in Production
The first time we built a three-layer memory system at scale, the hardest problem was not the storage or retrieval — it was defining what constitutes a 'fact' worth promoting from episodic to semantic memory. Episodic memories are cheap to store and retrieve. The consolidation step — deciding which facts are stable enough to promote — requires either careful prompt engineering or domain-specific rules.
For customer-facing enterprise agents, we have settled on a hierarchy of fact stability: stated preferences (high stability, promote immediately), inferred preferences (medium stability, promote after 3+ confirming episodes), situational context (low stability, keep episodic, do not promote). This taxonomy reduces false promotions — where a one-off comment becomes a permanent fact — significantly.
The LangMem library (released by the LangChain team in 2025) provides the cleanest unified API we have found for managing all three layers. For teams starting a new implementation, begin with LangMem's managed API before self-hosting. The self-hosted path requires running Postgres, Redis, and a vector database, and managing the consolidation pipeline — substantial infrastructure overhead for a team still validating the product.
Frequently Asked Questions
How does persistent memory work in AI agents?
What is the difference between episodic and semantic memory in AI agents?
What is LangMem and what does it do?
How do you handle GDPR compliance for agent memory systems?
How much storage does a persistent agent memory system require?
Written By
Inductivee Team
AuthorAgentic AI Engineering Team
The Inductivee engineering team — a remote-first group of multi-agent orchestration specialists, RAG pipeline architects, and data liquidity engineers who have shipped 40+ agentic deployments across 25+ enterprises since 2012. Our writing is grounded in what we actually build, break, and operate in production.
Inductivee is a remote-first agentic AI engineering firm with 40+ production deployments across 25+ enterprises since 2012. Our engineering content is written by active practitioners and technically reviewed before publication. Compliance: SOC2 Type II, HIPAA, GDPR, ISO 27001.
Engineer This With Inductivee
The engineering patterns in this article are what our team builds into production every day. Explore the related service to see how we deliver this capability at enterprise scale.
Agentic Custom Software Engineering
We engineer autonomous agentic systems that orchestrate enterprise workflows and unlock the hidden liquidity of your proprietary data.
ServiceAutonomous Agentic SaaS
Agentic SaaS development and autonomous platform engineering — we build SaaS products whose core loop is powered by LangGraph and CrewAI agents that execute workflows, not just manage them.
Related Articles
Context Window Management for Long-Running Agents: Engineering Patterns
RAG Pipeline Architecture for the Enterprise: Five Layers Beyond the Basic Chatbot
Agent Design Patterns: ReAct, Reflexion, Plan-and-Execute, and Supervisor-Worker
Ready to Build This Into Your Enterprise?
Inductivee engineers agentic systems, RAG pipelines, and enterprise data liquidity solutions. Let's scope your project.
Start a Project