Skip to main content

Announcing our $20m Series A from GV (Google Ventures) and Workday Ventures Read More

Guillaume Lebedel Guillaume Lebedel · · 15 min
Agentic Context Engineering: Why AI Agents Kill Their Own Context Windows

Agentic Context Engineering: Why AI Agents Kill Their Own Context Windows

Table of Contents

Your agent just killed itself.

It didn’t crash. It didn’t throw an exception. It made a perfectly reasonable decision that happened to generate 250,000 tokens of output, exceeding its 200K context window. The request failed, the task died, and the agent never understood why.

This is agent suicide by context: agents taking actions that destroy their own ability to continue.

What Is Agentic Context Engineering?

Agentic context engineering is the practice of controlling what information an AI agent sees at each step of execution — system prompts, tool definitions, conversation history, and tool responses — so the agent completes multi-step tasks within its token budget. Where prompt engineering crafts the initial instruction, context engineering manages the entire information lifecycle: what enters context, what gets compressed, what gets isolated to sub-agents, and what never enters at all.

Diagram showing how AI agent context windows fill up through tool calls, leading to context overflow

Why AI Agents Are Bad at Managing Context Limits

Context limits aren’t new, but the shape of the problem has changed. Early discussions focused on RAG: how much to pre-load before the conversation starts. Agentic workflows changed everything. Context now builds dynamically through tool calls, accumulating unpredictably as agents take actions.

Modern agents can see their token budgets. Claude models with extended thinking can track usage during agentic loops. But knowing your budget doesn’t mean you’ll manage it well. Anthropic’s context engineering guide (September 2025) documents how agents processing large codebases routinely exceed available context through sequential tool calls. Factory.ai’s research on context windows (August 2025) found that even million-token windows are insufficient for typical enterprise codebases, and that indiscriminate context inclusion degrades reasoning quality.

The fundamental issue is architectural. LLM interaction is single-threaded: request in, response out, tool call, response back. There’s no native way to fork, parallelize, or isolate risky operations. Without orchestration, everything happens in one context with no recovery path. The harness layer is where this gets fixed. Sub-agents add isolation, turning every tool call from a gamble into a recoverable operation.

And agents gamble constantly. When an agent calls a list tool or reads a file, it could check the size first, but most don’t. They do what they’re told (finding, searching, trawling for more context) even when it hurts them. The tool call looks reasonable. The 150K response is fatal.

We see this regularly at StackOne when building connectors and AI actions for Notion and GitHub. An agent retrieves a page, the page contains embedded databases, the response balloons to 200K tokens. Dead before it processes anything useful. We can help by offering options to retrieve less data. But we’re tool providers, not the agent harness. The architecture has to be designed for survival.

How AI Agents Exceed Their Context Windows

Fatal Context Overflows: How Agents Die on the First Tool Call

Some agents start so heavy that the first tool call finishes them off.

RAG Pre-Loading. An agent configured to retrieve “relevant context” at startup pulls 50 documents averaging 2,500 tokens each. That’s 125K tokens before the user says anything. The agent isn’t dead yet, but it’s wounded. The user asks a question, the agent makes one tool call to fetch more context, and that 80K response pushes it over.

// Agent starts with 125K of pre-loaded context
{
  "relevant_context": [
    { "doc_id": "auth-001", "content": "## Authentication Flow...", "tokens": 2847 },
    { "doc_id": "auth-002", "content": "## Token Refresh...", "tokens": 1923 },
    // ... 48 more documents
  ],
  "total_tokens": 125000  // Already 62% of 200K limit before user speaks
}
// User: "How does the payment flow work?" → +85K tokens → Dead

Tool Definition Catch-22. MCP servers expose tools with JSON schemas. Detailed descriptions and input/output schemas improve accuracy, but the more detail you add, the more context you burn. A server with 500+ tools can use 100K+ tokens on definitions alone.

The irony: you need good tool definitions for accuracy, but verbose definitions eat the context you need for actual work. Sparse definitions save tokens but increase hallucination. There’s no winning without dynamic tool loading. We built semantic search across 10,000+ actions to solve exactly this problem.

// Just 3 of 500+ tool definitions in the system prompt
{
  "tools": [
    {
      "name": "salesforce_create_lead",
      "description": "Creates a new lead in Salesforce CRM...",
      "parameters": {
        "type": "object",
        "properties": {
          "first_name": { "type": "string", "description": "Lead's first name" },
          "last_name": { "type": "string", "description": "Lead's last name" },
          "company": { "type": "string", "description": "Company name" },
          "email": { "type": "string", "format": "email" },
          // ... 20 more fields per tool
        },
        "required": ["last_name", "company"]
      }
    },
    // ... 499 more tools with similar verbosity
  ]
}

For a framework on when to use MCP schemas vs CLI tools that skip schemas entirely, see MCP vs CLI for AI agents: when each one wins.

Greedy File Reads. “Read the codebase to understand the structure” sounds reasonable. The agent reads three files, each 15K tokens. Then it decides it needs more context. Ten files later, it’s dead.

Gradual Context Degradation: Death by a Thousand Cuts

Other deaths are slower. Drew Breunig’s analysis of context failure modes (June 2025) identified what he calls context rot — the progressive degradation of model performance as context accumulates. Underlying research from Databricks Mosaic found that model correctness starts dropping after 32K tokens, with agents favoring repetitive actions from their growing history. The “lost in the middle” effect (Liu et al., 2023) compounds this: LLMs strongly favor information at the beginning and end of the context window while ignoring content placed in the middle, meaning an agent with 150K tokens of context may effectively disregard the middle 100K tokens of tool responses.

Conversation Bloat. Every turn adds tokens. Without compaction, a 20-turn conversation easily exceeds 200K tokens. The agent “remembers” everything but understands less and less.

Context Poisoning. When playing Pokemon Red, researchers found that Gemini agents would occasionally hallucinate, poisoning their own context with false information. Once the scratchpad was corrupted with hallucinated goals (like seeking items from a different Pokemon game), the agent became fixated on impossible objectives and adopted counterproductive strategies rather than navigating correctly.

Intermediate Result Accumulation. Each tool call returns results. Each result stays in context. An agent syncing data across Workday, BambooHR, and similar HRIS systems accumulates every response. Thirty API calls at 3K tokens each is 90K tokens of intermediate state.

// Context after 30 sequential API calls
{
  "conversation": [
    { "role": "user", "content": "Sync all employee data from our HRIS systems" },
    { "role": "assistant", "content": "I'll fetch employees from each system." },
    { "role": "tool", "name": "workday_list_employees", "tokens": 3200 },
    { "role": "tool", "name": "bamboohr_list_employees", "tokens": 2800 },
    { "role": "tool", "name": "adp_list_employees", "tokens": 3100 },
    // ... 27 more tool responses
  ],
  "accumulated_tool_tokens": 94500,
  "total_context": 98200  // Half the context is intermediate results
}

Why AI Agents Don’t Protect Themselves

Agents can be taught to monitor their own context. Some frameworks expose token counts. You can prompt agents to check response sizes before committing. The problem: context management is a job in itself.

An agent juggling “solve the user’s problem” and “don’t kill yourself” is doing two jobs at once. Tracking resource constraints competes with solving the actual problem. Every token spent reasoning about context is a token not spent on the problem.

This is partly why sub-agent architectures work: you can dedicate the orchestrator to planning and context management while sub-agents focus purely on execution. The orchestrator doesn’t need to predict response sizes if it delegates risky operations to disposable workers.

Context Engineering Architectures That Prevent Overflow

You can’t predict your way out of this. You need architectures that survive when prediction fails.

Sub-Agent Isolation: Containing Context Overflow

The strongest pattern: delegate risky operations to sub-agents with their own context windows.

Anthropic’s Claude Code does this. The main agent maintains a high-level plan with clean context. Sub-agents handle focused tasks (file search, code analysis, test execution). Each sub-agent might use 50K tokens exploring deeply, but returns only a 2K token summary.

If a sub-agent dies from context overflow, the orchestrator notices the failure and adapts. It can spawn a new sub-agent with different instructions: “search only the src/ directory” instead of “search the entire codebase.”

Sub-Agent Benefits

  • Isolation: Sub-agent death doesn’t kill the orchestrator
  • Recovery: Failed tasks can be retried with adjusted parameters
  • Compression: 50K tokens of exploration becomes 2K of summary
  • Specialization: Each agent optimized for its specific task

Code Mode: Filtering Data Before It Enters Context

Sub-agents isolate risk. But there’s another approach: never let large results enter context at all.

Code mode lets agents write code that fetches data, chains tool calls, and filters output before it enters context. See how MCP code mode keeps data out of agent context.

Cloudflare’s Code Mode architecture (February 2026) pioneered this approach, collapsing 2,500+ API endpoints into two tools consuming roughly 1,000 tokens. That’s down from 1.17 million tokens for a traditional MCP server. Anthropic’s Claude Code implements a similar pattern.

Traditional approach: call a tool, receive the response directly into context, process the response. If the response is 150K tokens, it’s now in your context whether you need it or not.

Code mode approach: write code that fetches, processes, and extracts only what you need. The difference is dramatic. Cloudflare’s benchmarks show 32% token savings for simple tasks and 81% for complex batch operations, with some scenarios hitting 99.9% reduction when agents filter data in a code execution environment instead of receiving it directly.

Consider an agent working with Google Drive spreadsheets and Salesforce records:

// Data filtering: agent sees 5 rows, not 10,000
const allRows = await gdrive.getSheet({ sheetId: 'abc123' });
const pendingOrders = allRows.filter(row => row["Status"] === 'pending');
console.log(pendingOrders.slice(0, 5));  // ~200 tokens instead of 40K

// Tool chaining: data flows between tools without entering context
const transcript = (await gdrive.getDocument({ documentId: 'notes' })).content;
await salesforce.updateRecord({
  objectType: 'SalesMeeting',
  recordId: '00Q5f000001abcXYZ',
  data: { Notes: transcript }  // 15K tokens never seen by the model
});

The agent is programming its own data pipeline, not passively receiving tool outputs. It can filter 10,000 rows to the 5 that matter, chain tool calls where data flows directly between tools, and run analysis in the execution environment instead of loading everything into context.

File-Based Iteration: Using the Filesystem as a Context Buffer

A related pattern uses the filesystem as a buffer. Instead of loading everything into context, dump results to files and use bash to iterate through them.

# Dump search results to a file
search_logs "authentication" > /tmp/results.json

# Agent can now selectively read what it needs
head -100 /tmp/results.json        # First 100 lines
jq '.items[0:5]' /tmp/results.json  # First 5 items
grep "error" /tmp/results.json | head -20  # Filtered subset

The full results live on disk. The agent iteratively explores with bash commands, each returning small slices. If it needs more, it reads more. If it doesn’t, those 50K tokens never touched context.

Memory Pointers for Token Reduction

Code mode writes to files. But what about data that needs to persist across many operations?

Anthropic’s context editing approach achieved 84% token reduction in a 100-turn web search evaluation by storing tool outputs externally and passing lightweight references. Instead of passing raw data through context, the agent references external storage by ID.

The agent sees: [memory:doc_12345] instead of the 10K token document. When it needs the content, it retrieves it. When it doesn’t, the pointer costs 20 tokens instead of 10,000.

// Without memory pointers: 50 documents x 10K tokens = 500K tokens
{
  "documents": [
    { "id": "doc_001", "content": "Full 10,000 token document..." },
    { "id": "doc_002", "content": "Another 10,000 token document..." },
    // ... 48 more full documents
  ],
  "total_tokens": 500000  // Impossible to fit in any context
}

// With memory pointers: 50 pointers x 20 tokens = 1K tokens
{
  "document_refs": [
    "[memory:doc_001]", "[memory:doc_002]", "[memory:doc_003]",
    // ... 47 more pointers
  ],
  "current_document": { "id": "doc_001", "content": "Full content..." },
  "total_tokens": 11000  // Only the active document + pointers
}

This works particularly well for repetitive operations. An agent processing 50 similar documents doesn’t need all 50 in context simultaneously. It needs the current document and references to the others.

Progressive Context Compaction

The techniques above handle tool response overflow. But conversations also accumulate tokens turn by turn, even without large tool responses. What about gradual bloat from conversation history itself?

Compaction summarizes conversation history to reclaim context space. Anthropic describes this as distilling context contents in a high-fidelity manner while preserving architectural decisions and implementation details.

The limitation: compaction only helps with gradual degradation. If an agent takes a single action that returns 250K tokens, compaction doesn’t save it. The death is instant.

Compaction works best when combined with other techniques. Use sub-agents to prevent one-shot kills. Use compaction to handle conversation accumulation over time.

Compaction Tradeoffs

  • Prevents: Gradual context bloat from long conversations
  • Doesn’t prevent: One-shot kills from massive tool responses
  • Risk: Overly aggressive summarization loses subtle context
  • Best for: Multi-turn tasks with clear milestones

Recursive Language Model (RLM)

Sub-agents work. But what if you need to process more data than any single agent can handle?

RLM (Recursive Language Model), developed by Zhang and Khattab at MIT CSAIL (2025), takes a different approach: store context in a Python REPL environment and let the model write code to interact with it.

Instead of loading data into the context window, RLM loads data into REPL variables. The model then writes Python code to peek at subsets, run regex queries, partition into chunks, and summarize results. Data lives in memory, not in context.

# Context stored in REPL variable, not model context
context = load_document("large_codebase.tar.gz")  # 500K tokens

# Model writes code to explore without loading everything
preview = context.peek(lines=100)  # See structure first
matches = context.search(r"def authenticate")  # Regex query
chunks = context.partition(chunk_size=5000)  # Split for processing

# Recursive call: spawn sub-instance with smaller context
for chunk in chunks:
    summary = rlm.call(query="Find security issues", context=chunk)
    results.append(summary)  # Only summaries enter main context

The REPL environment manages control flow. The model decides decomposition strategies at inference time: how to chunk, what to search for, when to spawn recursive calls. Each recursive call gets its own isolated context, processes a piece, and returns a summary.

This works for tasks that would otherwise be impossible: analyzing entire codebases, processing thousands of documents, running comprehensive security audits. The main agent’s context grows slowly because it only sees summaries, not raw data.

RLM Pattern

  • Storage: Data in REPL variables, not context window
  • Interaction: Model writes Python to peek, search, partition
  • Recursion: Spawns isolated sub-instances for chunks
  • Scale: Can process inputs far beyond any context limit

MCP Tool Metadata for Context Management

The architectures above are defensive. What about giving agents the information they need to protect themselves?

MCP is already moving in this direction. The spec includes a size field for resources that hosts can use to estimate context window usage. But this only covers resources, not tool responses.

Good tool design would extend this pattern:

// Tool definition with size hints
{
  name: "search_logs",
  description: "Search application logs",
  parameters: { ... },
  responseMetadata: {
    estimatedTokens: "variable",
    averageItemSize: 500,      // tokens per log entry
    defaultLimit: 100,         // items returned by default
    pagination: {
      supported: true,
      recommendedPageSize: 50
    }
  }
}

With this metadata, an agent can reason: “I have 60K tokens remaining. This tool returns ~500 tokens per item with a default limit of 100. That’s 50K tokens. I should request a smaller page or use a sub-agent.”

Even better: a dry run mode where the agent can ask “how big would this response be?” before committing:

// Dry run returns size estimate without fetching data
const estimate = await search.dryRun({ query: "auth errors", days: 30 });
// { estimatedTokens: 180000, itemCount: 3600 }

// Agent can now make an informed decision
if (estimate.estimatedTokens > remainingContext * 0.5) {
  // Delegate to sub-agent or adjust parameters
}

If you’re building tools and you give agents this information, their survival becomes their responsibility. They have the budget awareness, they have the size estimates, they can decide whether to proceed or delegate. Tools that surface their resource implications let agents make informed choices instead of blind ones.

Building a Context Engineering Strategy for AI Agents

The techniques above aren’t mutually exclusive. Most production agent systems combine several:

  • Sub-agents handle anything with unpredictable output (file reads, searches, API calls)
  • Code mode and file-based iteration keep large results out of context entirely
  • Compaction manages gradual bloat in long conversations
  • RLM patterns tackle tasks that exceed any single context window
  • Tool metadata lets agents make informed decisions about risky calls
  • Built-in filters let tools return exactly what’s needed without harness changes

The common thread: assume your agent will encounter context-killing situations, and design systems that recover gracefully when it happens.

What MCP Tool Providers Can Do for Context Engineering

Most survival architectures live in the agent harness. But tool providers aren’t helpless. There’s a spectrum of how much we can help.

Discovery at the server level. Instead of exposing 500 tools upfront, build discovery into the MCP server itself. At StackOne, we built tool search and execute capabilities that let agents discover available actions without loading every schema. Our hybrid BM25/TF-IDF search scores relevance across 10,000+ actions so agents load only what they need. A good harness can also solve this (Anthropic’s Claude SDK does dynamic tool loading), but embedding discovery in the server means it works regardless of how sophisticated the harness is.

Code mode is powerful but heavy. Letting agents generate code to filter and transform data works well. But it requires infrastructure: a secure sandbox to execute arbitrary code, proper isolation, timeout handling. That’s a lot to ask from every tool provider. Most won’t build it.

The middle ground: built-in filters. What if tools offered filter parameters using syntax LLMs already know? Think JSONPath or SQLite-style queries:

// Instead of returning 50K tokens of Notion pages...
notion.search({
  query: "Q4 planning",
  filter: "$.results[?(@.properties.status == 'In Progress')]",  // JSONPath
  select: ["id", "title", "last_edited"]  // Only these fields
})

// Or SQL-like for structured data
logs.query({
  where: "level = 'error' AND timestamp > '2026-01-01'",
  limit: 50,
  select: ["message", "stack_trace"]
})

This isn’t full code mode. No sandbox needed. The syntax is well-known to LLMs (they’ve seen millions of JSONPath and SQL examples). It doesn’t require generating and executing scripts, only a filtering parameter on existing tools. But it achieves the same goal: agents request exactly what they need instead of receiving everything and drowning.

The filtering happens server-side. The agent gets 2K tokens instead of 50K. No harness changes required.

StackOne’s AI action SDK for agent integrations implements this pattern across 200+ connectors. Agents query a single search tool to discover relevant actions, then execute with built-in pagination and field selection. No tool schema bloat, no context overflow.

Key Takeaways on Agentic Context Engineering

  • AI agents destroy their own context windows through six common patterns — and most don’t detect the failure until it’s too late
  • Sub-agent isolation is the strongest defense: failed sub-agents don’t kill the orchestrator, and 50K tokens of exploration compresses to a 2K summary
  • Code mode reduces context usage by 32-99.9% by filtering data before it enters the context window
  • Memory pointers replace raw documents with lightweight references, achieving 84% token reduction in Anthropic’s evaluations
  • MCP tool providers can help by adding response size metadata, dry-run modes, and built-in filters using JSONPath or SQL-like syntax
  • No single technique works alone — production systems combine sub-agents, code mode, compaction, and tool metadata for resilience

The Future of Agentic Context Engineering

The architectures above work. Sub-agents, code mode, RLM, memory pointers, tool metadata, built-in filters. They’re all effective. But they’re also fragmented. Every harness implements its own version. Every tool provider makes different choices.

SDKs and agent frameworks still lack most of this functionality, and that makes sense. The space moves fast. Commit to one approach today, and it might be obsolete in three months. Teams are cautious about building infrastructure that could become technical debt.

At StackOne, we build agent context management natively into our agent connectors. Assume agents will kill themselves. Design systems that recover anyway. The harnesses that get this right become the platforms, not because of features, but because their agents actually finish the task.

Frequently Asked Questions

What is agentic context engineering?
Agentic context engineering is the practice of managing what information an AI agent sees at each step of execution — system prompts, tool definitions, conversation history, and tool responses — so the agent completes multi-step tasks within its token budget. Unlike prompt engineering, which focuses on crafting the initial instruction, context engineering manages the entire information lifecycle.
Why do AI agents run out of context?
AI agents exceed context limits through six common patterns: RAG pre-loading that consumes 60%+ of tokens before the user speaks, tool definition bloat from hundreds of MCP schemas, greedy file reads that accumulate unchecked, conversation bloat from multi-turn history, context poisoning from hallucinated state, and intermediate result accumulation from sequential tool calls.
What is context suicide in AI agents?
Context suicide is when an AI agent takes a perfectly reasonable action — like reading a file or calling an API — that generates a response large enough to exceed its context window. The agent doesn't crash or throw an exception; the request fails, and the agent never understands why.
How do sub-agents prevent context overflow?
Sub-agents run risky operations in their own isolated context windows. If a sub-agent dies from context overflow, the orchestrator survives and can spawn a new sub-agent with adjusted parameters. A sub-agent might use 50K tokens exploring deeply but returns only a 2K token summary to the orchestrator.
What can MCP tool providers do about context overflow?
MCP tool providers can help by building discovery into servers (so agents don't load all tool schemas upfront), adding response size metadata and dry-run modes, and offering built-in filter parameters using JSONPath or SQL-like syntax so agents request only the data they need.

Put your AI agents to work

All the tools you need to build and scale AI agents integrations, with best-in-class security & privacy.