Architecture

AI Agent Memory Architecture: Long-Term, Persistent Cognition for Production Agents

Most agents forget everything when the session ends. Enterprise agents managing ongoing relationships, projects, and workflows need persistent memory — episodic, semantic, and procedural — that survives across sessions and scales with usage.

Inductivee Team· AI EngineeringOctober 27, 2025(updated April 15, 2026)13 min read

TL;DR

Persistent agent memory requires three distinct layers: episodic memory (vector-indexed interaction history for semantic recall), semantic memory (structured facts and preferences in a relational store), and procedural memory (learned patterns preserved via few-shot cache or fine-tuning). LangMem, released by LangChain in 2025, provides a unified API over all three layers. Memory consolidation — periodically compressing episodic memories into semantic summaries — is essential for managing scale without linear context growth.

Why Stateless Agents Fail Enterprise Use Cases

The default architecture for an LLM agent is stateless: each conversation starts fresh, with no knowledge of prior interactions. This is fine for one-shot tasks — generate this report, classify this document, answer this question. It is catastrophically wrong for enterprise agents that manage ongoing work: a procurement agent that negotiates with the same supplier weekly, a customer success agent that manages a 24-month enterprise relationship, or a project management agent that tracks work across dozens of parallel workstreams.

A stateless agent re-learns everything from context on every session. The user must re-explain preferences, prior decisions, and background context each time. The agent cannot improve its performance based on what worked and what did not in previous interactions. It cannot notice trends, track unresolved issues, or maintain commitments across sessions.

The memory problem is not just a UX convenience — it is an architectural requirement for agents that are supposed to replace ongoing human cognitive work. A human account manager remembers what was discussed in the last call. They know the customer's preferences, the outstanding action items, the history of issues. A persistent agent must do the same.

The Three-Layer Memory Model

Cognitive science distinguishes three types of human memory. The same taxonomy maps cleanly onto agent memory architecture:

Episodic Memory: What Happened

Episodic memory stores individual interaction records — what was said, what was decided, what tools were called, and what outcomes resulted. In agent architecture, episodic memories are stored as embeddings in a vector database, enabling semantic recall: 'retrieve the last 5 interactions where this customer raised a billing concern.' Redis Stack or Qdrant with TTL-based expiry handles episodic storage effectively. The key design decision is granularity: store at the message level (high fidelity, high storage cost) or at the interaction summary level (lower fidelity, cheaper). For most enterprise agents, interaction-level summaries (250-500 tokens per session) strike the right balance.

Semantic Memory: What Is Known

Semantic memory stores stable facts, preferences, and knowledge about entities the agent interacts with. For a customer success agent: customer industry, product tier, stated preferences, known constraints, decision-maker names, and relationship history. Unlike episodic memory (which is time-ordered interaction records), semantic memory is a structured knowledge graph that is updated as new facts are learned and old facts are superseded. Postgres with JSONB columns is the standard implementation — structured enough for reliable retrieval, flexible enough to accommodate evolving schemas. LangMem provides a semantic memory abstraction over Postgres.

Procedural Memory: How to Do Things

Procedural memory encodes learned skills and effective patterns. In human cognition this is muscle memory and automated behaviour. In agent architecture, it manifests as: few-shot examples of successful interactions cached for retrieval, prompt templates that encode learned best practices, or — for high-value production agents with sufficient data — fine-tuned model weights that have internalised domain patterns. Few-shot cache (storing successful interaction patterns as retrievable examples) is the most practical implementation for most enterprise agents. Fine-tuning is reserved for high-volume, well-defined tasks where the cost justifies the investment.

Memory Consolidation and Forgetting Strategies

Without consolidation, episodic memory grows unboundedly and retrieval quality degrades as signal-to-noise ratio drops. Two mechanisms manage this:

Memory Consolidation

Periodically compress episodic memories into semantic summaries. After every 10 sessions with a given entity, run a consolidation job: retrieve the last 10 session summaries, have the LLM extract persistent facts (new preferences stated, issues resolved, commitments made), update the semantic memory store with those facts, and archive the episodic records. LangMem's consolidation pipeline handles this automatically with configurable consolidation triggers.

Forgetting: TTL and Relevance Decay

Not all memories should persist indefinitely. TTL-based expiry removes episodic records after a configurable retention period (90 days is standard for most enterprise use cases). Relevance decay de-prioritises memories that have been superseded — if the semantic memory records that a customer preferred monthly billing but a more recent interaction records they switched to annual, the older preference should decay in retrieval ranking, not remain equally weighted.

Three-Layer Agent Memory with LangMem, Redis, and Postgres

python

import asyncio
import json
from datetime import datetime, timedelta
from typing import Optional
import redis.asyncio as redis
import asyncpg
from openai import AsyncOpenAI
from langmem import AsyncMemoryClient  # LangMem 0.1+ (LangChain, 2025)
from qdrant_client import AsyncQdrantClient
from qdrant_client.models import PointStruct, Distance, VectorParams
import uuid


class ThreeLayerAgentMemory:
    """
    Unified three-layer memory system for persistent enterprise agents.
    - Episodic: Redis Streams + Qdrant (interaction history with semantic search)
    - Semantic: Postgres JSONB (facts, preferences, entity knowledge)
    - Procedural: Redis Hash (few-shot example cache)
    """

    def __init__(
        self,
        redis_url: str,
        postgres_dsn: str,
        qdrant_host: str,
        langmem_api_key: str,
    ):
        self.redis_url = redis_url
        self.postgres_dsn = postgres_dsn
        self.qdrant = AsyncQdrantClient(host=qdrant_host, port=6333)
        self.openai = AsyncOpenAI()
        self.langmem = AsyncMemoryClient(api_key=langmem_api_key)
        self._redis: Optional[redis.Redis] = None
        self._pg: Optional[asyncpg.Connection] = None

    async def connect(self):
        self._redis = await redis.from_url(self.redis_url, decode_responses=True)
        self._pg = await asyncpg.connect(self.postgres_dsn)
        await self._pg.execute("""
            CREATE TABLE IF NOT EXISTS agent_semantic_memory (
                entity_id TEXT NOT NULL,
                key TEXT NOT NULL,
                value JSONB NOT NULL,
                confidence FLOAT DEFAULT 1.0,
                updated_at TIMESTAMPTZ DEFAULT NOW(),
                PRIMARY KEY (entity_id, key)
            )
        """)
        await self._ensure_qdrant_collection()

    async def _ensure_qdrant_collection(self):
        collections = await self.qdrant.get_collections()
        names = [c.name for c in collections.collections]
        if "episodic_memory" not in names:
            await self.qdrant.create_collection(
                "episodic_memory",
                vectors_config=VectorParams(size=768, distance=Distance.COSINE)
            )

    async def _embed(self, text: str) -> list[float]:
        resp = await self.openai.embeddings.create(
            input=text, model="text-embedding-3-small", dimensions=768
        )
        return resp.data[0].embedding

    # --- EPISODIC MEMORY ---

    async def store_episode(self, entity_id: str, session_id: str,
                            summary: str, metadata: dict) -> None:
        """Store a session summary as an episodic memory."""
        vector = await self._embed(summary)
        point_id = str(uuid.uuid4())
        await self.qdrant.upsert(
            "episodic_memory",
            points=[PointStruct(
                id=point_id,
                vector=vector,
                payload={
                    "entity_id": entity_id,
                    "session_id": session_id,
                    "summary": summary,
                    "timestamp": datetime.utcnow().isoformat(),
                    **metadata
                }
            )]
        )
        # TTL via Redis (90-day expiry marker)
        await self._redis.setex(
            f"ep_ttl:{point_id}", int(timedelta(days=90).total_seconds()), "1"
        )

    async def recall_episodes(
        self, entity_id: str, query: str, top_k: int = 5
    ) -> list[dict]:
        """Semantically retrieve relevant past episodes."""
        from qdrant_client.models import Filter, FieldCondition, MatchValue
        vector = await self._embed(query)
        results = await self.qdrant.search(
            "episodic_memory",
            query_vector=vector,
            query_filter=Filter(must=[
                FieldCondition(key="entity_id", match=MatchValue(value=entity_id))
            ]),
            limit=top_k,
            with_payload=True,
        )
        return [{"summary": r.payload["summary"],
                 "session_id": r.payload["session_id"],
                 "timestamp": r.payload["timestamp"],
                 "score": r.score} for r in results]

    # --- SEMANTIC MEMORY ---

    async def store_fact(self, entity_id: str, key: str,
                         value: dict, confidence: float = 1.0) -> None:
        """Store or update a structured fact about an entity."""
        await self._pg.execute("""
            INSERT INTO agent_semantic_memory (entity_id, key, value, confidence, updated_at)
            VALUES ($1, $2, $3::jsonb, $4, NOW())
            ON CONFLICT (entity_id, key)
            DO UPDATE SET value=$3::jsonb, confidence=$4, updated_at=NOW()
        """, entity_id, key, json.dumps(value), confidence)

    async def get_facts(self, entity_id: str) -> dict:
        """Retrieve all semantic facts for an entity."""
        rows = await self._pg.fetch(
            "SELECT key, value, confidence FROM agent_semantic_memory WHERE entity_id = $1",
            entity_id
        )
        return {row["key"]: {"value": json.loads(row["value"]),
                              "confidence": row["confidence"]} for row in rows}

    # --- PROCEDURAL MEMORY ---

    async def store_successful_pattern(
        self, pattern_key: str, example: dict, score: float
    ) -> None:
        """Cache a successful interaction pattern for few-shot retrieval."""
        await self._redis.zadd(
            f"procedural:{pattern_key}",
            {json.dumps(example): score}
        )
        await self._redis.zremrangebyrank(f"procedural:{pattern_key}", 0, -11)  # Keep top 10

    async def retrieve_patterns(self, pattern_key: str, top_k: int = 3) -> list[dict]:
        """Retrieve top-scoring procedural patterns for few-shot prompting."""
        raw = await self._redis.zrevrange(
            f"procedural:{pattern_key}", 0, top_k - 1
        )
        return [json.loads(item) for item in raw]

    # --- MEMORY CONSOLIDATION ---

    async def consolidate_memory(self, entity_id: str) -> None:
        """Extract persistent facts from recent episodes and update semantic memory."""
        recent = await self.recall_episodes(entity_id, "recent interactions", top_k=10)
        if len(recent) < 5:
            return  # Not enough episodes to consolidate
        episodes_text = "\n\n".join(
            f"[{ep['timestamp']}]\n{ep['summary']}" for ep in recent
        )
        resp = await self.openai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{
                "role": "user",
                "content": f"""Extract persistent facts from these interaction summaries.
Return JSON: {{"facts": [{{"key": "str", "value": {{"summary": "str"}}, "confidence": 0.0-1.0}}]}}
\n{episodes_text}"""
            }],
            temperature=0, response_format={"type": "json_object"}
        )
        extracted = json.loads(resp.choices[0].message.content)
        for fact in extracted.get("facts", []):
            await self.store_fact(entity_id, fact["key"],
                                  fact["value"], fact["confidence"])


# Usage example
async def main():
    memory = ThreeLayerAgentMemory(
        redis_url="redis://localhost:6379",
        postgres_dsn="postgresql://user:pass@localhost/agentdb",
        qdrant_host="localhost",
        langmem_api_key="lm-..."
    )
    await memory.connect()
    await memory.store_episode(
        entity_id="customer_acme",
        session_id="sess_2025101401",
        summary="Customer raised concern about Q4 pricing. Agreed to review in January.",
        metadata={"sentiment": "neutral", "topics": ["pricing", "renewal"]}
    )
    episodes = await memory.recall_episodes(
        "customer_acme", "pricing concerns", top_k=3
    )
    print(episodes)


if __name__ == "__main__":
    asyncio.run(main())

Three-layer agent memory implementation using Qdrant for episodic vector storage, Postgres JSONB for semantic facts, and Redis Sorted Sets for procedural few-shot cache. The consolidate_memory method runs as a background job after every 10 sessions to extract persistent facts from episode history and promote them to semantic memory.

Warning

Memory systems introduce GDPR and data retention compliance requirements that are easy to overlook during development. Every episodic memory record containing user interaction data is potentially a personal data record under GDPR Article 4. Before deploying persistent agent memory in production, confirm: (1) your data retention policy defines maximum TTLs for episodic records, (2) you have a delete-by-entity-ID endpoint that purges all three memory layers on data subject erasure requests, and (3) semantic memory facts derived from personal data are classified accordingly.

Memory Architecture Design Principles

Separate episodic and semantic storage from the start — retrofitting them onto a single store later is more painful than a slightly more complex initial design.
Design memory consolidation before you design memory storage. Consolidation is the mechanism that keeps the system scalable; without it, episodic memory becomes a liability.
Implement a memory retrieval test harness before connecting memory to an agent. Verify that the retrieval system surfaces relevant memories and not noisy ones.
Version your semantic memory schema. Facts change, and you need to be able to migrate existing records when the schema evolves.
Monitor memory hit rate in production — the percentage of agent sessions where memory retrieval surfaced at least one relevant record. A hit rate below 40% indicates retrieval quality issues, not a storage problem.

Inductivee's Memory Architecture in Production

The first time we built a three-layer memory system at scale, the hardest problem was not the storage or retrieval — it was defining what constitutes a 'fact' worth promoting from episodic to semantic memory. Episodic memories are cheap to store and retrieve. The consolidation step — deciding which facts are stable enough to promote — requires either careful prompt engineering or domain-specific rules.

For customer-facing enterprise agents, we have settled on a hierarchy of fact stability: stated preferences (high stability, promote immediately), inferred preferences (medium stability, promote after 3+ confirming episodes), situational context (low stability, keep episodic, do not promote). This taxonomy reduces false promotions — where a one-off comment becomes a permanent fact — significantly.

The LangMem library (released by the LangChain team in 2025) provides the cleanest unified API we have found for managing all three layers. For teams starting a new implementation, begin with LangMem's managed API before self-hosting. The self-hosted path requires running Postgres, Redis, and a vector database, and managing the consolidation pipeline — substantial infrastructure overhead for a team still validating the product.

Frequently Asked Questions

How does persistent memory work in AI agents?

Persistent agent memory uses three layers: episodic memory (vector-indexed interaction summaries for semantic recall across sessions), semantic memory (structured facts and preferences in a relational database), and procedural memory (successful interaction patterns cached for few-shot retrieval). LangMem provides a unified API over all three layers. Memory consolidation periodically compresses episodic records into semantic facts to prevent unbounded storage growth.

What is the difference between episodic and semantic memory in AI agents?

Episodic memory stores time-ordered interaction records — what was discussed in each session — and is retrieved semantically (find the most relevant past episodes for this query). Semantic memory stores stable facts and entity knowledge — preferences, decisions, known constraints — that are updated as new facts are learned and superseded facts decay. Episodic memory is stored in a vector database; semantic memory is stored in a structured relational store.

What is LangMem and what does it do?

LangMem is a memory management library released by LangChain in 2025 that provides a unified API for episodic, semantic, and procedural agent memory. It handles memory storage, retrieval, and consolidation — the process of extracting persistent facts from interaction history and promoting them to long-term semantic memory. It supports both managed API and self-hosted deployment modes.

How do you handle GDPR compliance for agent memory systems?

Agent memory systems storing interaction data are subject to GDPR personal data requirements. Compliance requires: TTL-based automatic expiry for episodic records (90 days is standard), a delete-by-entity-ID endpoint that purges all memory layers for data subject erasure requests, classification of semantic facts derived from personal data, and a data processing record documenting what is stored and why.

How much storage does a persistent agent memory system require?

For an enterprise agent handling 100 conversations per day with 90-day episodic retention, expect approximately 50-100MB of episodic vector storage per month (at 768 dimensions and 500-token summaries), plus a negligible amount for semantic facts and procedural cache. At this scale, storage cost is not the constraint — retrieval quality and consolidation pipeline reliability are.

Written By

Inductivee Team

Author

Agentic AI Engineering Team

The Inductivee engineering team — a remote-first group of multi-agent orchestration specialists, RAG pipeline architects, and data liquidity engineers who have shipped 40+ agentic deployments across 25+ enterprises since 2012. Our writing is grounded in what we actually build, break, and operate in production.

Agentic AI ArchitectureMulti-Agent OrchestrationLangChainLangGraphCrewAIMicrosoft AutoGen

LinkedIn profile

Inductivee is a remote-first agentic AI engineering firm with 40+ production deployments across 25+ enterprises since 2012. Our engineering content is written by active practitioners and technically reviewed before publication. Compliance: SOC2 Type II, HIPAA, GDPR, ISO 27001.

Engineer This With Inductivee

The engineering patterns in this article are what our team builds into production every day. Explore the related service to see how we deliver this capability at enterprise scale.

Service

Ready to Build This Into Your Enterprise?

Inductivee engineers agentic systems, RAG pipelines, and enterprise data liquidity solutions. Let's scope your project.

Start a Project

We value your privacy