Context Window Management for Long-Running Agents: Engineering Patterns
Even 200K context windows overflow in long-running enterprise agent workflows. Here are the engineering patterns — sliding windows, summarisation chains, and selective memory — that keep agents coherent across extended tasks.
Even with Claude 3.5 Sonnet's 200K token context window and GPT-4o's 128K window, long-running enterprise agents overflow in practice — because the context window contains not just conversation history but system prompts (500-2000 tokens), tool definitions (200-1000 tokens per tool), retrieved document chunks (potentially thousands of tokens), and intermediate reasoning traces from multi-step workflows. Production context management is not just about fitting within the limit — it is about maintaining coherent, relevant context at each step while discarding noise that degrades answer quality. Full context windows produce worse outputs than well-managed smaller contexts.
Why Long-Running Agents Overflow Even 200K Context Windows
The engineering assumption that a 200K token context window 'solves' context management for enterprise agents is incorrect, and it causes teams to defer context management engineering until a production failure forces the issue. Understanding why requires accounting for all the components that consume context in a real agentic workflow.
A production agent system prompt takes 500-2,000 tokens. Tool definitions for 10-15 tools take 2,000-5,000 tokens. A moderately complex conversation with 20 turns takes 5,000-15,000 tokens. Retrieved document chunks for a single RAG query take 2,000-8,000 tokens. In a long-running workflow with 30 steps, each step adding tool call results, the context can exceed 100,000 tokens without any single component being unreasonably large. For the 70B models running on self-hosted infrastructure, context length also directly affects inference latency — a 100K token context takes 3-4x longer to process than a 25K context, which compounds across a multi-step workflow.
Beyond the overflow problem, there is a quality problem. Research consistently shows that LLM performance on specific facts and instructions degrades when they appear early in a very long context — the 'lost in the middle' effect documented in the LLM literature means that the most important instructions and facts should appear at the beginning and end of the context, not buried in the middle of a 150K token accumulation of prior reasoning. A carefully managed 30K token context with the most relevant history often produces better outputs than a 150K token context that includes everything.
Context Management Strategies: Trade-offs Comparison
| Strategy | Best For | Trade-off | Implementation Complexity |
|---|---|---|---|
| Sliding window (fixed N messages) | Conversational agents with sequential context | Early context lost entirely; agent may re-ask answered questions | Low |
| Sliding window with overlap | Multi-turn workflows where adjacent steps are most relevant | More context than fixed window but still loses distant history | Low |
| MapReduce summarisation | Long documents or tool outputs that must be retained | Summaries lose detail; expensive (many LLM calls for long content) | Medium |
| Refine summarisation | Progressive summarisation as new content arrives | Sequential — cannot parallelise; latency increases with content length | Medium |
| Selective context injection | Workflows where not all history is relevant to each step | Requires relevance scoring; may miss relevant context if scorer fails | High |
| External memory with semantic search | Long-running agents spanning hours/days with large history | Latency for retrieval; requires vector store infrastructure | High |
| Conversation compression (LLM-based) | Mixed: compress old history, keep recent turns verbatim | LLM calls add latency and cost; quality depends on compression prompt | Medium |
The Four Production-Viable Context Management Patterns
Pattern 1: Sliding Window with Overlap
The simplest production-viable pattern: keep the last N turns verbatim plus a configurable overlap of M turns from earlier in the conversation. The overlap prevents the jarring context discontinuity of a pure fixed window where the agent suddenly 'forgets' a decision made N+1 turns ago. Set N based on your model's context budget (accounting for system prompt and tool definitions), and M to 2-3 turns.
Sliding window is appropriate for conversational agents where the current task is primarily informed by the last few exchanges — customer support, interactive data analysis, multi-turn document review. It is not appropriate for workflows where a decision made early in the session constrains actions later — the agent cannot reference that decision if it has slid out of the window.
Pattern 2: Incremental Compression
When the context approaches a configurable threshold (e.g., 70% of the model's context limit), compress the oldest N messages into a summary using an LLM call, then replace those messages with the summary. The recent K messages are always kept verbatim. This pattern is more coherent than sliding window because it retains a compressed representation of the full history rather than discarding it entirely.
The compression prompt is critical: it should instruct the LLM to preserve decisions made, commitments created, facts established, and open questions — the information that is most likely to be relevant to future steps. Generic summaries that capture narrative flow without preserving these structured facts are not useful for agent context management.
Pattern 3: Selective Context Injection
Rather than managing a single linear context, selective injection maintains a full history in an external store (a vector database or a structured message store) and at each step retrieves only the K most relevant prior messages based on semantic similarity to the current query or task. This is the highest-quality pattern for complex long-running workflows where different earlier exchanges are relevant to different later steps.
The engineering cost is higher: you need a vector store, an embedding pipeline for message content, and a retrieval query at each step. The operational complexity is justified for workflows spanning hours or days where the agent's history is large enough that no summarisation strategy can adequately compress it without losing important context.
Pattern 4: External Memory with Semantic Search
For agents that must maintain memory across separate sessions — a strategic planning agent that continues a workflow from a previous day, or a customer success agent that must recall conversations from weeks ago — in-context compression is insufficient. External memory stores the full conversation history and any extracted facts/decisions in a persistent store (vector database for semantic retrieval, optionally a structured store for explicit facts) and retrieves relevant context at the start of each new session.
This pattern is implemented in LangChain via the ConversationEntityMemory and VectorStoreRetrieverMemory classes, or as a custom implementation that generates a 'session brief' at session start by retrieving the N most relevant prior exchanges and the extracted entity/fact store. The session brief replaces a full conversation history in the context.
A Production Context Manager with Auto-Compression
import tiktoken
import logging
from typing import Optional
from dataclasses import dataclass, field
from langchain_openai import ChatOpenAI
from langchain_core.messages import BaseMessage, HumanMessage, AIMessage, SystemMessage
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
logger = logging.getLogger(__name__)
@dataclass
class ContextBudget:
"""Token budget allocation for a single LLM call."""
model_max_tokens: int # Model's hard context limit
system_prompt_tokens: int # Reserved for system prompt
tool_definitions_tokens: int # Reserved for tool definitions
generation_tokens: int # Reserved for model's output
history_budget: int = field(init=False)
def __post_init__(self):
self.history_budget = (
self.model_max_tokens
- self.system_prompt_tokens
- self.tool_definitions_tokens
- self.generation_tokens
)
if self.history_budget < 1000:
raise ValueError(
f"Insufficient history budget ({self.history_budget} tokens). "
f"Reduce system prompt, tool definitions, or generation reservation."
)
class ProductionContextManager:
"""
Manages conversation context for long-running agents with auto-compression.
Strategy: Keep the last KEEP_VERBATIM messages intact.
When total history exceeds the budget, compress the oldest messages
into a structured summary that preserves decisions, facts, and commitments.
"""
KEEP_VERBATIM = 6 # Always keep the last 6 messages verbatim
COMPRESSION_THRESHOLD = 0.75 # Compress when at 75% of history budget
def __init__(
self,
budget: ContextBudget,
model_name: str = "gpt-4o",
compression_model: str = "gpt-4o-mini" # cheaper model for compression
):
self.budget = budget
self.model_name = model_name
self.messages: list[BaseMessage] = []
self.compression_count = 0
try:
self.tokenizer = tiktoken.encoding_for_model(model_name)
except KeyError:
self.tokenizer = tiktoken.get_encoding("cl100k_base")
self.compression_llm = ChatOpenAI(
model=compression_model,
temperature=0,
max_tokens=800 # Summaries should be concise
)
def _count_tokens(self, messages: list[BaseMessage]) -> int:
"""Count tokens for a list of messages."""
total = 0
for msg in messages:
# Account for message overhead (role, format tokens)
total += 4 # approximate overhead per message
total += len(self.tokenizer.encode(str(msg.content)))
return total
def _compress_messages(self, messages: list[BaseMessage]) -> AIMessage:
"""Compress a batch of messages into a structured summary."""
if not messages:
return AIMessage(content="[No prior context]")
conversation_text = "\n".join(
f"{type(m).__name__.replace('Message', '')}: {m.content[:500]}"
for m in messages
)
compression_prompt = ChatPromptTemplate.from_messages([
("system",
"""You are compressing conversation history for an AI agent.
Preserve ALL of the following:
- Decisions made and their rationale
- Facts established (numbers, names, dates, statuses)
- Commitments or actions agreed to
- Open questions or blockers identified
- Any constraints or requirements specified
Format as a structured summary under these headings:
DECISIONS: [bullet points]
ESTABLISHED FACTS: [bullet points]
COMMITMENTS: [bullet points]
OPEN ITEMS: [bullet points]
Be specific. Preserve exact values (numbers, names, IDs) not paraphrases.
Maximum 600 words."""),
("human", "Compress this conversation history:\n\n{history}")
])
chain = compression_prompt | self.compression_llm | StrOutputParser()
summary = chain.invoke({"history": conversation_text})
self.compression_count += 1
logger.info(
f"context_compressed | compression_count={self.compression_count} | "
f"messages_compressed={len(messages)} | summary_chars={len(summary)}"
)
return AIMessage(content=f"[COMPRESSED CONTEXT - Summary of {len(messages)} prior messages]\n{summary}")
def add_message(self, message: BaseMessage) -> None:
"""Add a message and trigger compression if budget threshold is exceeded."""
self.messages.append(message)
self._maybe_compress()
def _maybe_compress(self) -> None:
"""Check if compression is needed and compress if so."""
current_tokens = self._count_tokens(self.messages)
threshold_tokens = int(self.budget.history_budget * self.COMPRESSION_THRESHOLD)
if current_tokens <= threshold_tokens:
return
# Protect the last KEEP_VERBATIM messages from compression
if len(self.messages) <= self.KEEP_VERBATIM:
logger.warning(
f"context_budget_exceeded | tokens={current_tokens} | budget={self.budget.history_budget} | "
f"cannot_compress: only {len(self.messages)} messages"
)
return
messages_to_compress = self.messages[:-self.KEEP_VERBATIM]
messages_to_keep = self.messages[-self.KEEP_VERBATIM:]
summary_message = self._compress_messages(messages_to_compress)
self.messages = [summary_message] + messages_to_keep
new_tokens = self._count_tokens(self.messages)
logger.info(
f"context_after_compression | before_tokens={current_tokens} | "
f"after_tokens={new_tokens} | reduction_pct={round((1 - new_tokens/current_tokens)*100, 1)}"
)
def get_messages(self) -> list[BaseMessage]:
"""Return the current managed context window."""
return list(self.messages)
def get_token_usage(self) -> dict:
"""Return current token usage statistics."""
current = self._count_tokens(self.messages)
return {
"current_tokens": current,
"budget_tokens": self.budget.history_budget,
"utilisation_pct": round(current / self.budget.history_budget * 100, 1),
"compression_count": self.compression_count,
"message_count": len(self.messages)
}
# ---- Usage Example ----
if __name__ == "__main__":
# GPT-4o configuration: 128K total, reserve 2K system + 3K tools + 4K generation
budget = ContextBudget(
model_max_tokens=128000,
system_prompt_tokens=2000,
tool_definitions_tokens=3000,
generation_tokens=4000
)
ctx = ProductionContextManager(budget=budget)
# Simulate a long-running agent conversation
for i in range(25):
ctx.add_message(HumanMessage(
content=f"Step {i+1}: Analyse the Q{(i%4)+1} financial data for the {['Americas','EMEA','APAC'][i%3]} region and identify variance drivers."
))
ctx.add_message(AIMessage(
content=f"Analysis for Step {i+1}: Revenue variance of ${(i+1)*120}K driven by [detailed analysis would appear here across multiple sentences covering all the key points for this step of the analysis workflow]..."
))
usage = ctx.get_token_usage()
if (i + 1) % 5 == 0:
print(f"After turn {i+1}: {usage['current_tokens']} tokens ({usage['utilisation_pct']}% of budget), compressions: {usage['compression_count']}")
print(f"\nFinal message count: {len(ctx.get_messages())}")
print(f"Final token usage: {ctx.get_token_usage()}")A production context manager that auto-compresses when approaching 75% of the history token budget. The compression prompt is specifically designed for agent workflows — it preserves decisions, facts, and commitments rather than producing generic narrative summaries.
The most impactful context management optimisation in production is also the cheapest: audit your system prompt and tool definitions for token waste. System prompts with repetitive caveats, tool descriptions that are paragraph-length essays instead of concise function descriptions, and few-shot examples that are not actually necessary for the task at hand commonly consume 30-40% of a context budget that could be allocated to history. Before implementing any dynamic compression strategy, run a token audit on the static components of your context. We have found 3,000-8,000 recoverable tokens in nearly every system prompt we have audited.
Implementing Context Management in a Production Agent
Instrument token counting before building compression
Before implementing any compression strategy, add token counting instrumentation to your agent. Log the token count at each step broken down by component: system prompt, tool definitions, conversation history, tool results, current user message. This instrumentation is both the prerequisite for compression (you need to know when to trigger it) and the diagnostic tool for identifying where budget is being consumed inefficiently. Run this instrumentation for a week in staging against real-world inputs before making any architectural decisions.
Choose the right strategy for your workflow pattern
Match the compression strategy to the workflow shape. Single-session conversational agents with short task horizons (customer support, interactive Q&A): use sliding window with KEEP_VERBATIM=6-8. Multi-step analytical workflows within a single session (report generation, due diligence): use incremental compression with a structured summary prompt. Long-running agents spanning multiple sessions or days (strategic planning, ongoing project management): use external memory with semantic search for cross-session retrieval plus incremental compression for within-session history.
Write compression prompts for your specific domain
Generic 'summarise this conversation' prompts produce summaries optimised for narrative coherence, not for agent re-use. Write compression prompts that explicitly preserve the artefacts your specific agent needs: for a procurement agent, preserve vendor names, prices, and compliance decisions; for a legal review agent, preserve clause locations, identified risks, and escalation recommendations; for a data analysis agent, preserve specific metric values, identified anomalies, and agreed hypotheses. Domain-specific compression prompts consistently outperform generic ones by preserving the precise facts the agent needs to continue correctly.
Test compression under adversarial conditions
Test your context manager with sessions designed to trigger compression multiple times: 50-turn conversations, tool results with very large payloads, rapid-fire short exchanges that accumulate message count faster than token count. Verify that the agent's behaviour remains coherent after compression — ask it to recall a specific decision made before compression and verify it can retrieve it from the summary. The failure mode to watch for is 'soft forgetting' — the agent produces plausible-sounding answers about prior decisions that are actually hallucinations because the compression summary did not preserve the specific detail.
How Inductivee Engineers Context Management at Scale
Context management is part of the Orchestrate phase at Inductivee — the engineering layer that makes agentic systems reliable in production rather than impressive in a demo. Our standard production configuration combines a token-counting middleware layer (instrumented at every LLM call), an incremental compression strategy with domain-specific summary prompts, and a configurable threshold that triggers compression at 70-75% of the history budget to maintain headroom for unexpected large tool results.
The single most valuable lesson from our deployments: the compression prompt is the most sensitive and high-leverage component of the entire context management system. A well-engineered compression prompt that preserves the right structured facts can reduce context size by 80% while maintaining agent coherence across the full session. A generic compression prompt that produces narrative summaries can reduce context size by 80% while producing an agent that subtly hallucinates prior decisions for the remainder of the session. The difference is invisible in the token count and visible only in the quality of downstream outputs — which is why we now treat compression prompt design with the same rigour as system prompt design.
Frequently Asked Questions
What is context window management and why does it matter for AI agents?
What is the maximum context window size for leading LLMs in 2025?
What is the best strategy for managing long conversations with AI agents?
How do you count tokens for LLM context management in Python?
Does a larger context window mean the model uses all of it effectively?
Written By
Inductivee Team
AuthorAgentic AI Engineering Team
The Inductivee engineering team — a remote-first group of multi-agent orchestration specialists, RAG pipeline architects, and data liquidity engineers who have shipped 40+ agentic deployments across 25+ enterprises since 2012. Our writing is grounded in what we actually build, break, and operate in production.
Inductivee is a remote-first agentic AI engineering firm with 40+ production deployments across 25+ enterprises since 2012. Our engineering content is written by active practitioners and technically reviewed before publication. Compliance: SOC2 Type II, HIPAA, GDPR, ISO 27001.
Engineer This With Inductivee
The engineering patterns in this article are what our team builds into production every day. Explore the related service to see how we deliver this capability at enterprise scale.
Agentic Custom Software Engineering
We engineer autonomous agentic systems that orchestrate enterprise workflows and unlock the hidden liquidity of your proprietary data.
ServiceAutonomous Agentic SaaS
Agentic SaaS development and autonomous platform engineering — we build SaaS products whose core loop is powered by LangGraph and CrewAI agents that execute workflows, not just manage them.
Related Articles
AI Agent Memory Architecture: Long-Term, Persistent Cognition for Production Agents
What Is Agentic AI? A Practical Guide for Enterprise Engineering Teams
RAG Pipeline Architecture for the Enterprise: Five Layers Beyond the Basic Chatbot
Ready to Build This Into Your Enterprise?
Inductivee engineers agentic systems, RAG pipelines, and enterprise data liquidity solutions. Let's scope your project.
Start a Project