Architecture

Context Window Management for Long-Running Agents: Engineering Patterns

Even 200K context windows overflow in long-running enterprise agent workflows. Here are the engineering patterns — sliding windows, summarisation chains, and selective memory — that keep agents coherent across extended tasks.

Inductivee Team· AI EngineeringAugust 27, 2025(updated April 15, 2026)12 min read

TL;DR

Even with Claude 3.5 Sonnet's 200K token context window and GPT-4o's 128K window, long-running enterprise agents overflow in practice — because the context window contains not just conversation history but system prompts (500-2000 tokens), tool definitions (200-1000 tokens per tool), retrieved document chunks (potentially thousands of tokens), and intermediate reasoning traces from multi-step workflows. Production context management is not just about fitting within the limit — it is about maintaining coherent, relevant context at each step while discarding noise that degrades answer quality. Full context windows produce worse outputs than well-managed smaller contexts.

Why Long-Running Agents Overflow Even 200K Context Windows

The engineering assumption that a 200K token context window 'solves' context management for enterprise agents is incorrect, and it causes teams to defer context management engineering until a production failure forces the issue. Understanding why requires accounting for all the components that consume context in a real agentic workflow.

A production agent system prompt takes 500-2,000 tokens. Tool definitions for 10-15 tools take 2,000-5,000 tokens. A moderately complex conversation with 20 turns takes 5,000-15,000 tokens. Retrieved document chunks for a single RAG query take 2,000-8,000 tokens. In a long-running workflow with 30 steps, each step adding tool call results, the context can exceed 100,000 tokens without any single component being unreasonably large. For the 70B models running on self-hosted infrastructure, context length also directly affects inference latency — a 100K token context takes 3-4x longer to process than a 25K context, which compounds across a multi-step workflow.

Beyond the overflow problem, there is a quality problem. Research consistently shows that LLM performance on specific facts and instructions degrades when they appear early in a very long context — the 'lost in the middle' effect documented in the LLM literature means that the most important instructions and facts should appear at the beginning and end of the context, not buried in the middle of a 150K token accumulation of prior reasoning. A carefully managed 30K token context with the most relevant history often produces better outputs than a 150K token context that includes everything.

Context Management Strategies: Trade-offs Comparison

Strategy	Best For	Trade-off	Implementation Complexity
Sliding window (fixed N messages)	Conversational agents with sequential context	Early context lost entirely; agent may re-ask answered questions	Low
Sliding window with overlap	Multi-turn workflows where adjacent steps are most relevant	More context than fixed window but still loses distant history	Low
MapReduce summarisation	Long documents or tool outputs that must be retained	Summaries lose detail; expensive (many LLM calls for long content)	Medium
Refine summarisation	Progressive summarisation as new content arrives	Sequential — cannot parallelise; latency increases with content length	Medium
Selective context injection	Workflows where not all history is relevant to each step	Requires relevance scoring; may miss relevant context if scorer fails	High
External memory with semantic search	Long-running agents spanning hours/days with large history	Latency for retrieval; requires vector store infrastructure	High
Conversation compression (LLM-based)	Mixed: compress old history, keep recent turns verbatim	LLM calls add latency and cost; quality depends on compression prompt	Medium

The Four Production-Viable Context Management Patterns

Pattern 1: Sliding Window with Overlap

The simplest production-viable pattern: keep the last N turns verbatim plus a configurable overlap of M turns from earlier in the conversation. The overlap prevents the jarring context discontinuity of a pure fixed window where the agent suddenly 'forgets' a decision made N+1 turns ago. Set N based on your model's context budget (accounting for system prompt and tool definitions), and M to 2-3 turns.

Sliding window is appropriate for conversational agents where the current task is primarily informed by the last few exchanges — customer support, interactive data analysis, multi-turn document review. It is not appropriate for workflows where a decision made early in the session constrains actions later — the agent cannot reference that decision if it has slid out of the window.

Pattern 2: Incremental Compression

When the context approaches a configurable threshold (e.g., 70% of the model's context limit), compress the oldest N messages into a summary using an LLM call, then replace those messages with the summary. The recent K messages are always kept verbatim. This pattern is more coherent than sliding window because it retains a compressed representation of the full history rather than discarding it entirely.

The compression prompt is critical: it should instruct the LLM to preserve decisions made, commitments created, facts established, and open questions — the information that is most likely to be relevant to future steps. Generic summaries that capture narrative flow without preserving these structured facts are not useful for agent context management.

Pattern 3: Selective Context Injection

Rather than managing a single linear context, selective injection maintains a full history in an external store (a vector database or a structured message store) and at each step retrieves only the K most relevant prior messages based on semantic similarity to the current query or task. This is the highest-quality pattern for complex long-running workflows where different earlier exchanges are relevant to different later steps.

The engineering cost is higher: you need a vector store, an embedding pipeline for message content, and a retrieval query at each step. The operational complexity is justified for workflows spanning hours or days where the agent's history is large enough that no summarisation strategy can adequately compress it without losing important context.

Pattern 4: External Memory with Semantic Search

For agents that must maintain memory across separate sessions — a strategic planning agent that continues a workflow from a previous day, or a customer success agent that must recall conversations from weeks ago — in-context compression is insufficient. External memory stores the full conversation history and any extracted facts/decisions in a persistent store (vector database for semantic retrieval, optionally a structured store for explicit facts) and retrieves relevant context at the start of each new session.

This pattern is implemented in LangChain via the ConversationEntityMemory and VectorStoreRetrieverMemory classes, or as a custom implementation that generates a 'session brief' at session start by retrieving the N most relevant prior exchanges and the extracted entity/fact store. The session brief replaces a full conversation history in the context.

A Production Context Manager with Auto-Compression

python

import tiktoken
import logging
from typing import Optional
from dataclasses import dataclass, field
from langchain_openai import ChatOpenAI
from langchain_core.messages import BaseMessage, HumanMessage, AIMessage, SystemMessage
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

logger = logging.getLogger(__name__)


@dataclass
class ContextBudget:
    """Token budget allocation for a single LLM call."""
    model_max_tokens: int  # Model's hard context limit
    system_prompt_tokens: int  # Reserved for system prompt
    tool_definitions_tokens: int  # Reserved for tool definitions
    generation_tokens: int  # Reserved for model's output
    history_budget: int = field(init=False)

    def __post_init__(self):
        self.history_budget = (
            self.model_max_tokens
            - self.system_prompt_tokens
            - self.tool_definitions_tokens
            - self.generation_tokens
        )
        if self.history_budget < 1000:
            raise ValueError(
                f"Insufficient history budget ({self.history_budget} tokens). "
                f"Reduce system prompt, tool definitions, or generation reservation."
            )


class ProductionContextManager:
    """
    Manages conversation context for long-running agents with auto-compression.

    Strategy: Keep the last KEEP_VERBATIM messages intact.
    When total history exceeds the budget, compress the oldest messages
    into a structured summary that preserves decisions, facts, and commitments.
    """

    KEEP_VERBATIM = 6  # Always keep the last 6 messages verbatim
    COMPRESSION_THRESHOLD = 0.75  # Compress when at 75% of history budget

    def __init__(
        self,
        budget: ContextBudget,
        model_name: str = "gpt-4o",
        compression_model: str = "gpt-4o-mini"  # cheaper model for compression
    ):
        self.budget = budget
        self.model_name = model_name
        self.messages: list[BaseMessage] = []
        self.compression_count = 0

        try:
            self.tokenizer = tiktoken.encoding_for_model(model_name)
        except KeyError:
            self.tokenizer = tiktoken.get_encoding("cl100k_base")

        self.compression_llm = ChatOpenAI(
            model=compression_model,
            temperature=0,
            max_tokens=800  # Summaries should be concise
        )

    def _count_tokens(self, messages: list[BaseMessage]) -> int:
        """Count tokens for a list of messages."""
        total = 0
        for msg in messages:
            # Account for message overhead (role, format tokens)
            total += 4  # approximate overhead per message
            total += len(self.tokenizer.encode(str(msg.content)))
        return total

    def _compress_messages(self, messages: list[BaseMessage]) -> AIMessage:
        """Compress a batch of messages into a structured summary."""
        if not messages:
            return AIMessage(content="[No prior context]")

        conversation_text = "\n".join(
            f"{type(m).__name__.replace('Message', '')}: {m.content[:500]}"
            for m in messages
        )

        compression_prompt = ChatPromptTemplate.from_messages([
            ("system",
             """You are compressing conversation history for an AI agent.
             Preserve ALL of the following:
             - Decisions made and their rationale
             - Facts established (numbers, names, dates, statuses)
             - Commitments or actions agreed to
             - Open questions or blockers identified
             - Any constraints or requirements specified

             Format as a structured summary under these headings:
             DECISIONS: [bullet points]
             ESTABLISHED FACTS: [bullet points]
             COMMITMENTS: [bullet points]
             OPEN ITEMS: [bullet points]

             Be specific. Preserve exact values (numbers, names, IDs) not paraphrases.
             Maximum 600 words."""),
            ("human", "Compress this conversation history:\n\n{history}")
        ])

        chain = compression_prompt | self.compression_llm | StrOutputParser()
        summary = chain.invoke({"history": conversation_text})

        self.compression_count += 1
        logger.info(
            f"context_compressed | compression_count={self.compression_count} | "
            f"messages_compressed={len(messages)} | summary_chars={len(summary)}"
        )

        return AIMessage(content=f"[COMPRESSED CONTEXT - Summary of {len(messages)} prior messages]\n{summary}")

    def add_message(self, message: BaseMessage) -> None:
        """Add a message and trigger compression if budget threshold is exceeded."""
        self.messages.append(message)
        self._maybe_compress()

    def _maybe_compress(self) -> None:
        """Check if compression is needed and compress if so."""
        current_tokens = self._count_tokens(self.messages)
        threshold_tokens = int(self.budget.history_budget * self.COMPRESSION_THRESHOLD)

        if current_tokens <= threshold_tokens:
            return

        # Protect the last KEEP_VERBATIM messages from compression
        if len(self.messages) <= self.KEEP_VERBATIM:
            logger.warning(
                f"context_budget_exceeded | tokens={current_tokens} | budget={self.budget.history_budget} | "
                f"cannot_compress: only {len(self.messages)} messages"
            )
            return

        messages_to_compress = self.messages[:-self.KEEP_VERBATIM]
        messages_to_keep = self.messages[-self.KEEP_VERBATIM:]

        summary_message = self._compress_messages(messages_to_compress)
        self.messages = [summary_message] + messages_to_keep

        new_tokens = self._count_tokens(self.messages)
        logger.info(
            f"context_after_compression | before_tokens={current_tokens} | "
            f"after_tokens={new_tokens} | reduction_pct={round((1 - new_tokens/current_tokens)*100, 1)}"
        )

    def get_messages(self) -> list[BaseMessage]:
        """Return the current managed context window."""
        return list(self.messages)

    def get_token_usage(self) -> dict:
        """Return current token usage statistics."""
        current = self._count_tokens(self.messages)
        return {
            "current_tokens": current,
            "budget_tokens": self.budget.history_budget,
            "utilisation_pct": round(current / self.budget.history_budget * 100, 1),
            "compression_count": self.compression_count,
            "message_count": len(self.messages)
        }


# ---- Usage Example ----
if __name__ == "__main__":
    # GPT-4o configuration: 128K total, reserve 2K system + 3K tools + 4K generation
    budget = ContextBudget(
        model_max_tokens=128000,
        system_prompt_tokens=2000,
        tool_definitions_tokens=3000,
        generation_tokens=4000
    )

    ctx = ProductionContextManager(budget=budget)

    # Simulate a long-running agent conversation
    for i in range(25):
        ctx.add_message(HumanMessage(
            content=f"Step {i+1}: Analyse the Q{(i%4)+1} financial data for the {['Americas','EMEA','APAC'][i%3]} region and identify variance drivers."
        ))
        ctx.add_message(AIMessage(
            content=f"Analysis for Step {i+1}: Revenue variance of ${(i+1)*120}K driven by [detailed analysis would appear here across multiple sentences covering all the key points for this step of the analysis workflow]..."
        ))

        usage = ctx.get_token_usage()
        if (i + 1) % 5 == 0:
            print(f"After turn {i+1}: {usage['current_tokens']} tokens ({usage['utilisation_pct']}% of budget), compressions: {usage['compression_count']}")

    print(f"\nFinal message count: {len(ctx.get_messages())}")
    print(f"Final token usage: {ctx.get_token_usage()}")

A production context manager that auto-compresses when approaching 75% of the history token budget. The compression prompt is specifically designed for agent workflows — it preserves decisions, facts, and commitments rather than producing generic narrative summaries.

Tip

The most impactful context management optimisation in production is also the cheapest: audit your system prompt and tool definitions for token waste. System prompts with repetitive caveats, tool descriptions that are paragraph-length essays instead of concise function descriptions, and few-shot examples that are not actually necessary for the task at hand commonly consume 30-40% of a context budget that could be allocated to history. Before implementing any dynamic compression strategy, run a token audit on the static components of your context. We have found 3,000-8,000 recoverable tokens in nearly every system prompt we have audited.

Implementing Context Management in a Production Agent

Instrument token counting before building compression

Before implementing any compression strategy, add token counting instrumentation to your agent. Log the token count at each step broken down by component: system prompt, tool definitions, conversation history, tool results, current user message. This instrumentation is both the prerequisite for compression (you need to know when to trigger it) and the diagnostic tool for identifying where budget is being consumed inefficiently. Run this instrumentation for a week in staging against real-world inputs before making any architectural decisions.

Choose the right strategy for your workflow pattern

Match the compression strategy to the workflow shape. Single-session conversational agents with short task horizons (customer support, interactive Q&A): use sliding window with KEEP_VERBATIM=6-8. Multi-step analytical workflows within a single session (report generation, due diligence): use incremental compression with a structured summary prompt. Long-running agents spanning multiple sessions or days (strategic planning, ongoing project management): use external memory with semantic search for cross-session retrieval plus incremental compression for within-session history.

Write compression prompts for your specific domain

Generic 'summarise this conversation' prompts produce summaries optimised for narrative coherence, not for agent re-use. Write compression prompts that explicitly preserve the artefacts your specific agent needs: for a procurement agent, preserve vendor names, prices, and compliance decisions; for a legal review agent, preserve clause locations, identified risks, and escalation recommendations; for a data analysis agent, preserve specific metric values, identified anomalies, and agreed hypotheses. Domain-specific compression prompts consistently outperform generic ones by preserving the precise facts the agent needs to continue correctly.

Test compression under adversarial conditions

Test your context manager with sessions designed to trigger compression multiple times: 50-turn conversations, tool results with very large payloads, rapid-fire short exchanges that accumulate message count faster than token count. Verify that the agent's behaviour remains coherent after compression — ask it to recall a specific decision made before compression and verify it can retrieve it from the summary. The failure mode to watch for is 'soft forgetting' — the agent produces plausible-sounding answers about prior decisions that are actually hallucinations because the compression summary did not preserve the specific detail.

How Inductivee Engineers Context Management at Scale

Context management is part of the Orchestrate phase at Inductivee — the engineering layer that makes agentic systems reliable in production rather than impressive in a demo. Our standard production configuration combines a token-counting middleware layer (instrumented at every LLM call), an incremental compression strategy with domain-specific summary prompts, and a configurable threshold that triggers compression at 70-75% of the history budget to maintain headroom for unexpected large tool results.

The single most valuable lesson from our deployments: the compression prompt is the most sensitive and high-leverage component of the entire context management system. A well-engineered compression prompt that preserves the right structured facts can reduce context size by 80% while maintaining agent coherence across the full session. A generic compression prompt that produces narrative summaries can reduce context size by 80% while producing an agent that subtly hallucinates prior decisions for the remainder of the session. The difference is invisible in the token count and visible only in the quality of downstream outputs — which is why we now treat compression prompt design with the same rigour as system prompt design.

Frequently Asked Questions

What is context window management and why does it matter for AI agents?

Context window management is the engineering discipline of keeping the content passed to an LLM at each step within the model's token limit while preserving the most relevant information for the current task. It matters for production agents because context windows overflow even at 200K tokens in long-running workflows — system prompts, tool definitions, conversation history, and retrieved document chunks accumulate rapidly. Beyond overflow, a full context window produces worse outputs than a well-managed smaller one due to the 'lost in the middle' effect, where important information buried in a very long context is effectively ignored by the model.

What is the maximum context window size for leading LLMs in 2025?

As of August 2025, Claude 3.5 Sonnet has a 200K token context window and GPT-4o has a 128K token context window. However, the effective usable context for production agent workflows is significantly smaller — accounting for system prompts (500-2,000 tokens), tool definitions (2,000-5,000 tokens), and generation output reservation (2,000-4,000 tokens), the history budget for a GPT-4o deployment is approximately 110-115K tokens. In a multi-step workflow with large tool results, this is consumed faster than the raw number suggests.

What is the best strategy for managing long conversations with AI agents?

The best strategy depends on the workflow pattern. For single-session conversational agents, incremental compression — compressing the oldest messages into a structured summary when approaching the token budget threshold, while keeping the most recent messages verbatim — provides the best balance of context retention and budget management. For agents that must maintain memory across multiple sessions, external memory with semantic search (storing full history in a vector database and retrieving the most relevant prior exchanges at each step) is necessary. The compression prompt's content is as important as the strategy itself — preserve specific decisions, facts, and commitments, not narrative summaries.

How do you count tokens for LLM context management in Python?

Use the tiktoken library (pip install tiktoken) for OpenAI models — tiktoken.encoding_for_model('gpt-4o').encode(text) returns the token list for a string. For Claude models, the anthropic Python SDK provides client.count_tokens() for exact token counting. As a rule of thumb, English prose averages 4 characters per token and code averages 3 characters per token, which is useful for rough budget estimation before implementing exact counting. Always instrument your production agent with exact token counting — rough estimates consistently undercount by 15-25% once message formatting overhead is included.

Does a larger context window mean the model uses all of it effectively?

No — research consistently shows LLM performance degrades on specific facts and instructions that appear in the middle of very long contexts, a phenomenon called 'lost in the middle.' Models most reliably attend to information at the beginning and end of the context. For production agents, this means: critical instructions should appear in the system prompt (start of context), the most relevant recent history should appear closest to the current message (end of context), and a full 200K context with everything stuffed in is often worse than a well-managed 40K context with the most relevant information carefully selected. Context quality beats context quantity.

Written By

Inductivee Team

Author

Agentic AI Engineering Team

The Inductivee engineering team — a remote-first group of multi-agent orchestration specialists, RAG pipeline architects, and data liquidity engineers who have shipped 40+ agentic deployments across 25+ enterprises since 2012. Our writing is grounded in what we actually build, break, and operate in production.

Agentic AI ArchitectureMulti-Agent OrchestrationLangChainLangGraphCrewAIMicrosoft AutoGen

LinkedIn profile

Inductivee is a remote-first agentic AI engineering firm with 40+ production deployments across 25+ enterprises since 2012. Our engineering content is written by active practitioners and technically reviewed before publication. Compliance: SOC2 Type II, HIPAA, GDPR, ISO 27001.

Engineer This With Inductivee

The engineering patterns in this article are what our team builds into production every day. Explore the related service to see how we deliver this capability at enterprise scale.

Service

Ready to Build This Into Your Enterprise?

Inductivee engineers agentic systems, RAG pipelines, and enterprise data liquidity solutions. Let's scope your project.

Start a Project

We value your privacy