Skip to main content

Announcing our $20m Series A from GV (Google Ventures) and Workday Ventures Read More

Guillaume Lebedel Guillaume Lebedel · · 12 min
MCP Token Optimization: 4 Approaches to Fix Context Window Bloat

MCP Token Optimization: 4 Approaches to Fix Context Window Bloat

Table of Contents

MCP token optimization addresses two distinct costs that consume an AI agent’s context window: schema bloat and response bloat.

Schema bloat is the token cost of loading tool definitions into context. GitHub’s official MCP server consumes 17,600 tokens of tool definitions per request. Connect multiple servers and you reach 30,000+ tokens of metadata before the agent does any work.

Response bloat is the token cost of tool outputs flowing back through context. A single HRIS API call can return hundreds of thousands of characters of raw JSON. Response bloat often consumes more context than schema bloat but receives less attention.

Most optimization approaches address only one side. This post compares four approaches, what each solves, and the sourced benchmarks behind them.

MCP Schema Compression: Reduce Tool Definition Tokens

MCP schema compression reduces the token cost of tool definitions by stripping descriptions, enums, and nested type documentation while preserving the parameter structure agents need to call the tool.

Atlassian’s mcp-compressor is an open-source proxy that implements this approach. It sits between your MCP servers and the LLM, offering four compression levels that trade description richness for token savings.

GitHub MCP server (94 tools) at each level:

LevelWhat’s keptTokensReduction
NoneFull descriptions, enums, types17,600
LowFull descriptions preserved~3,90078%
Medium (default)First sentence of each description~3,30081%
HighTool names + parameter names only~2,20088%
MaxOnly a list_tools() function~50097%

Source: Atlassian, benchmarked on GitHub’s official MCP server.

The tradeoff is accuracy. At high compression, a tool called create_jira_issue has no description. Only parameter names remain. If another tool called create_confluence_page shares the same parameter names (title, description, project), the model has to guess which one creates a Jira task and which creates a Confluence wiki page.

SolvesDoesn’t solve
Schema compressionSchema bloat. Drop-in proxy, no changes to servers or agent logic.Response bloat. And over-compression hurts tool selection accuracy.

When to use schema compression: Teams running multiple MCP servers who need low-effort token reduction. Start at medium compression, benchmark tool selection accuracy on real tasks, then tighten where accuracy holds.

MCP Tool Search: On-Demand Schema Discovery

Search-first tool discovery replaces upfront schema loading with on-demand retrieval. The agent starts with 2-3 lightweight meta-tools (search and execute) instead of the full tool catalog. When it needs a capability, it searches by natural language, retrieves only the matching tool definitions, and executes. The full catalog never enters context.

Implementations vary, but the architecture is the same:

  • Claude Code Tool Search — activates automatically when tool descriptions exceed 10% of context. Replaces the schema dump with a searchable registry. (Anthropic)
  • Speakeasy Dynamic Toolsets — three meta-tools (search_tools, describe_tools, execute_tool). The agent discovers capabilities conversationally.
  • StackOne Tool SearchsearchTools() uses semantic search (with local BM25+TF-IDF fallback) across 200+ connectors and 10,000+ actions. The agent sees exactly 2 tools (tool_search + tool_execute) regardless of catalog size.

The numbers from Speakeasy’s benchmarks:

ScenarioStatic toolsetDynamic toolsetReduction
Simple task (400 tools)Over 400,000 tokens~6,000 tokens98.5%
Complex task (400 tools)Over 400,000 tokens~35,000 tokens91.2%
200 tools loaded statically261,700 tokensExceeds Claude’s 200K context window

Source: Speakeasy published benchmarks.

Accuracy improves too. With fewer irrelevant tools in context, models pick the right tool more often:

ModelWithout Tool SearchWith Tool SearchGainSource
Opus 449%74%+25 percentage pointsAnthropic
Opus 4.579.5%88.1%+8.6 percentage pointsAnthropic
SolvesDoesn’t solve
Search-first discoverySchema bloat. Context stays constant regardless of catalog size. Better tool selection accuracy.Response bloat. Once a tool executes, the full output flows back. Also adds latency (~50% more round trips per tool call).

When to use search-first discovery: Catalogs with 50+ tools where loading all definitions upfront is impractical. StackOne uses this pattern in production across 200+ connectors and 10,000+ actions.

MCP Response Filtering: Reduce Tool Output Tokens

MCP response filtering reduces the token cost of tool outputs by requesting only the fields the agent needs, rather than returning full API payloads. Where schema compression and search-first discovery address input tokens, response filtering addresses output tokens.

Consider an HRIS list_employees call that returns 50 fields per record — name, email, SSN, benefits, compensation history, emergency contacts, manager chain. The agent needs 5 fields. The other 45 consume context without contributing to the task.

Field selection trims the response before it hits the agent:

Illustrative example
Full API response (50 fields × 20 records)Hundreds of thousands of characters
Filtered response (5 fields × 20 records)~8,000 characters
Reduction~95%+ per call

Note: Exact sizes vary by provider and record count. The pattern is consistent — unfiltered HRIS/CRM responses are an order of magnitude larger than what the agent needs.

MCP servers that support field selection — like GraphQL-based implementations or servers built with response shaping in mind — can apply this filtering server-side before the response reaches the agent.

But filtering doesn’t solve accumulation. Even with 5 fields per call, tokens add up across a multi-step workflow:

  • Step 1: ~2K tokens
  • Step 2: ~2K tokens
  • Step 3: ~3K tokens
  • After 5 steps: 12-15K tokens of tool responses in context

This triggers what Stanford researchers call the “lost in the middle” problem — LLM performance degrades by over 20% when relevant information sits in the middle of long context rather than at the beginning or end. Responses from earlier steps may be effectively ignored by the time the agent reaches step 5.

SolvesDoesn’t solve
Response filteringResponse bloat on a per-call basis. Essential hygiene for any MCP server returning raw API payloads.Accumulation across multi-step workflows. Also requires the MCP server to support field selection — many don’t.

When to use response filtering: Any integration where tool responses contain more data than the agent needs. Particularly important for HRIS, CRM, and ATS integrations where API responses include dozens of fields per record.

MCP Code Execution: Eliminate Schema and Response Bloat

Code-based execution (also called Code Mode) replaces MCP tool schemas and raw API responses with a single execute_code tool. It is the only approach that addresses both schema bloat and response bloat in one architectural move.

How it works. Instead of loading tool schemas into the agent’s context, a gateway gives the agent a sandboxed code execution environment running in a V8 isolate (for example, on Cloudflare Workers). The agent discovers available APIs, writes code against them, and only the filtered result comes back into context.

The agent does not write code from nothing. The gateway exposes API definitions that the agent inspects before writing code.

Code Mode execution flow: discover available functions, inspect schemas, write code, execute in sandbox, return filtered result

At StackOne, Code Mode implements this as a gateway in front of any remote MCP server. The agent uses execute_code to run JavaScript in a V8 isolate, search_tools to discover available functions at runtime, and .__schema() to inspect input parameters before calling. Credentials are encrypted with AES-256-GCM and injected at the transport layer so the LLM never sees them. It works with any MCP-compatible client: Claude Code, Cursor, VS Code, Windsurf, Gemini CLI, and others.

Anthropic and Cloudflare have published similar patterns. Anthropic’s sandbox exposes TypeScript function signatures the agent reads before writing code. Cloudflare’s exposes a pre-resolved OpenAPI spec the agent navigates with a search() tool. The architectural principle is the same across all three.

// StackOne Code Mode — the agent writes this after discovering
// the HRIS API via search_tools and inspecting via .__schema().
// It runs in a V8 sandbox at the edge, not in the agent's context.
employees
  .filter({ dept: 'Engineering' })
  .sortBy('perf_score', 'desc')
  .pick(['name', 'perf_score', 'manager'])
  .take(10)
// 10 rows come back. ~312 characters. The full employee list never enters context.

Sourced benchmarks:

ImplementationBeforeAfterReductionSource
StackOne Code Mode55,000+ chars schema416 chars99.3%StackOne
Anthropic (spreadsheet filtering)150,000 tokens2,000 tokens98.7%Anthropic engineering blog
Cloudflare (2,500 API endpoints → 2 tools)1,170,000 tokens~1,000 tokens99.9%Cloudflare blog

Accuracy impact. StackOne benchmarked Code Mode against the MCP Atlas evaluation suite. The results show Code Mode significantly improves agent tool-use accuracy:

ModelNative MCP+ StackOne Code ModeSource
Sonnet 4.642%80%StackOne (MCP Atlas benchmark)
Opus 4.662%StackOne (MCP Atlas benchmark)

A cheaper model with Code Mode outperforms the flagship model without it. The improvement comes from freeing context for reasoning. When tool schemas and verbose responses no longer consume 90%+ of the context window, the model has more capacity to select the right tool and execute correctly.

What code execution requires:

  • A sandboxed runtime. V8 isolates, Cloudflare Workers, or containerized environments with resource limits and monitoring.
  • API discoverability. The agent needs to discover available functions and inspect their schemas within the sandbox before writing code.
  • Credential injection. Credentials must be injected at the transport layer so the agent’s code can authenticate with APIs without the LLM ever seeing secrets.
SolvesDoesn’t solve
Code-based executionBoth schema bloat and response bloat. Fixed context footprint regardless of API surface. 96-99% token reduction.Requires sandbox infrastructure. Higher setup complexity than the other three approaches. Agent must generate correct code.

When to use code-based execution: Workflows where both schema and response bloat contribute to context pressure. Particularly effective for large API surfaces (hundreds of endpoints) and data-heavy connectors like BambooHR, Salesforce, and Workday.

MCP Token Optimization: All 4 Approaches Compared

ApproachSchema reductionResponse reductionSetup costBest for
Schema Compression70-97%NoneLow (drop-in proxy)Quick wins on multi-server setups
Search-First Discovery91-97%NoneLow-MediumLarge catalogs (50+ tools)
Response FilteringNone~95% per callMediumLarge API responses (HRIS, CRM)
Code-Based Execution98-99%97-99%HighBoth problems at scale

How to Choose an MCP Token Optimization Strategy

When tool schemas consume too much context

  • Start with search-first discovery. Claude Code enables this automatically. StackOne’s getSearchTool() gives the agent 2 tools regardless of catalog size.
  • Add schema compression as a drop-in proxy for additional reduction.

When tool responses are too large

  • Response filtering is essential. Build field selection into your MCP server or use one that supports it.

When both schemas and responses are the problem

  • Code-based execution (Code Mode) addresses both in one architectural move. Higher setup cost, but fixed context footprint regardless of API surface.

When combining multiple approaches

  • Search-first discovery + response filtering covers most use cases.
  • Add Code Mode for the highest-volume workflows where both sides matter.

A common gap is optimizing schema bloat while leaving response bloat unaddressed. Measuring both input and output token costs before selecting an approach avoids solving the smaller problem.

At StackOne, we use search-first discovery across 200+ connectors and 10,000+ actions to keep schema costs constant regardless of catalog size. StackOne Code Mode goes further, eliminating both schema and response bloat through sandboxed code execution at the edge. For a deeper look at context window failure patterns, see six patterns that kill agent context windows.

Frequently Asked Questions

How many tokens do MCP tool schemas use?
MCP tool schemas typically consume 500-1,400 tokens each. GitHub's official MCP server (94 tools) uses 17,600 tokens per request for tool definitions alone. The Atlassian MCP server uses roughly 10,000 tokens for Jira and Confluence tools. Connect multiple servers and schema costs reach 30,000+ tokens before the agent does any work.
What is the difference between MCP schema bloat and response bloat?
MCP schema bloat is the token cost of loading tool definitions into context. MCP response bloat is the token cost of tool outputs flowing back. A single HRIS API call can return hundreds of thousands of characters of raw JSON. Most optimization approaches only address schema bloat. Response bloat often consumes more context but gets less attention.
What is MCP code mode?
MCP code mode replaces traditional tool schemas with a single execute_code tool. The agent writes code that runs in a sandboxed V8 isolate, calls underlying APIs, filters results, and returns only the answer. StackOne Code Mode implements this as a gateway for any remote MCP server. Anthropic documented 98.7% token reduction. Cloudflare achieved 99.9% by compressing 2,500 API endpoints into 2 tools.
How do you reduce MCP token usage?
To reduce MCP token usage, four approaches are available: (1) schema compression — strip descriptions and enums from tool definitions, 70-97% reduction; (2) search-first tool discovery — load only the tools needed per task via semantic search; (3) response filtering — request only the fields the agent needs, reducing output tokens; (4) code-based execution — the agent writes filtering code that runs at the edge, addressing both schema and response bloat.
Which MCP token optimization approach should I use?
The right MCP token optimization approach depends on your bottleneck. If you have many tools but small responses, schema compression or search-first discovery is sufficient. If tool responses are large (HRIS, CRM data), response filtering is essential. If you need both schema and response optimization, code-based execution handles both in one move. Most production systems combine 2-3 approaches.

Put your AI agents to work

All the tools you need to build and scale AI agents integrations, with best-in-class security & privacy.