This is an English translation of a Japanese blog. Some content may not be fully translated.
MCP

Measuring and Improving Context Usage of Document MCP Servers

Introduction

MCP servers provide external tools to LLM agents, but all tool responses go into the LLM’s context window. This means the amount of data an MCP tool returns directly translates to context consumption.

This article documents the process and results of measuring the context usage of my custom Snowflake Docs MCP server, identifying issues, and implementing improvements.

The Relationship Between MCP Tool Responses and Context

The results of MCP tool calls are loaded into the context window as input to the LLM.

[User message] + [System prompt] + [MCP tool results] + [Past conversation] = Context consumption

Document retrieval MCP tools can return thousands to tens of thousands of characters per call. Without careful design, just a few calls can consume most of the context, making it difficult to continue the conversation.

When Using Subagents

When MCP tools are used via Claude Code’s subagent (Task tool), the raw tool responses stay within the subagent’s context, and only a summary is returned to the main context. This can dramatically reduce main context consumption.

Method Main context consumption
Direct call Full text of tool results
Via subagent Summary only (50-90% reduction)

However, since subagents increase total token consumption and add latency, routing everything through subagents is not practical. It is important to control response sizes appropriately on the MCP server side.

State Before Improvement

Before improvement, the server provided two tools.

Tool Function
search_snowflake_docs Keyword search (excerpt 200 chars, max 20 results)
get_doc_content Full page retrieval (HTML to Markdown conversion)

Measurement Results

Measurements were taken using the CREATE TABLE page (one of the largest pages in Snowflake documentation).

search_snowflake_docs:

max_results Response chars Estimated tokens
3 results ~850 ~250
5 results (default) ~1,400 ~400
10 results ~2,800 ~800
20 results (max) ~5,500 ~1,500

The search tool is lightweight and not problematic.

get_doc_content:

max_length Response chars Estimated tokens Notes
2,000 ~2,200 ~600
4,000 ~4,200 ~1,100
8,000 (default) ~8,200 ~2,200
0 (unlimited) 102,079 ~25,000+ Exceeds context limit

Issues Identified

1. Text Duplication Bug

The HTML to Markdown conversion used a descendants iterator, causing the same text to appear twice for <p> tags inside <li> elements.

- Requires a value (NOT NULL).
Requires a value (NOT NULL).          ← duplicate

- Has a default value.
Has a default value.                  ← duplicate

A list of 10 items expanded to 20 lines, with 30-40% of content wasted on unnecessary duplication.

2. Context Failure at max_length=0

Retrieving the CREATE TABLE page without limits returned 102,079 characters, exceeding Claude Code’s token limit and being offloaded to a file. Once offloaded to a file, it can no longer be directly referenced within the context, making it essentially unusable.

3. No Way to Retrieve Specific Sections

The approach of truncating from the beginning of the page to max_length characters made it impossible to reach important information in the latter half of pages, such as Usage Notes or Examples.

4. No Way to Check Page Structure

To find out what sections existed, a full page retrieval with get_doc_content was required.

5. Noise in Markdown Output

After HTML to Markdown conversion, many unnecessary artifacts remained for the LLM.

## Usage notes[ΒΆ](#usage-notes "Link to this heading")     ← pilcrow links (~40 chars/heading)
Copy                                                        ← remnants of copy buttons
[COPY INTO <table>](copy-into-table)                        ← unresolvable relative links

The CREATE TABLE page alone (with 24 headings) consumed approximately 960 characters in pilcrow links.

Improvement Process

Studying the AWS Official MCP Server

As a reference for improvements, I studied the implementation of the documentation MCP server officially provided by AWS. Key design points that were instructive:

  • Using the markdownify library: Rather than a hand-written HTML parser, it uses the markdownify library for HTML to Markdown conversion, which structurally prevents issues like duplication bugs
  • Pagination via start_index: Design for reading long documents in chunks using the start_index parameter
  • Safe max_length constraint: Pydantic Field with gt=0, lt=1000000 constraint to prevent 0 (unlimited) input
  • Best practices in instructions: Explicitly states “Stop reading documents over 30,000 characters once the needed information is found”

Improvement 1: HTML to Markdown Conversion Quality

Replaced the hand-written descendants loop with the markdownify library. This resolved the text duplication bug, effectively increasing the information density by approximately 1.5x for the same character count.

Additionally, implemented noise removal after Markdown conversion:

  • Remove pilcrow links ([ΒΆ](...))
  • Remove copy button remnants (Copy)
  • Remove link URLs (keep text only)
  • Remove image and SVG tags
  • Remove feedback, rating, survey, and form elements

The key is separating the removal stages. Buttons and images are removed at the HTML stage with decompose(), while pilcrow and Copy remnants are removed at the text level after Markdown conversion. <a> tags are unwrap()-ed to remove the tag while preserving the text.

Improvement 2: Introducing Gradual Information Retrieval

Expanded from 2 tools to 5, enabling gradual information retrieval.

search_snowflake_docs  β†’ Document search (with include_headings=True, can also get TOC simultaneously)
search_in_doc          β†’ In-page keyword search (auto-selects relevant sections)
get_doc_toc            β†’ Get page table of contents (lightweight)
get_doc_section        β†’ Retrieve a specific section
get_doc_content        β†’ Retrieve full page content (with pagination)

Recommended usage flow:

1. search_snowflake_docs(include_headings=True) to get search results + TOC in one call
2. search_in_doc to automatically extract only relevant sections
3. Only specify include_code_blocks=True when code examples are needed

Improvement 3: Response Size Control

Added a hard cap on max_length to prevent context failures.

max_length value Behavior
0 Honors full retrieval intent, returns up to hard cap (20,000)
1–20,000 Uses specified value as-is
20,001+ Capped at 20,000
Negative Falls back to default (5,000)

Additionally introduced pagination via start_index parameter and natural boundary cutting (priority: paragraph β†’ line β†’ sentence β†’ word).

Improvement 4: Code Blocks Off by Default

Code blocks occupy the majority of context on syntax reference pages. Measurements on the CREATE TABLE page showed that removing code blocks achieves a 37.7% context reduction.

Mode Size Reduction
Code blocks ON 52,694 chars -
Code blocks OFF 32,851 chars 37.7%

Changed the default value of the include_code_blocks parameter to False, requiring explicit specification only when syntax or code examples are needed.

Improvement 5: Reducing Tool Call Count

Added the include_headings parameter to search_snowflake_docs, enabling retrieval of heading structure simultaneously with search.

Before: search β†’ get_doc_toc β†’ get_doc_section  (3 calls)
After:  search(include_headings=True) β†’ get_doc_section  (2 calls)

However, pages with many headings (DB2 migration guides with 120+ headings, etc.) caused response bloat, so a cap of 30 headings per result was added. When exceeded, it narrows down to h1-h3 only.

Added the search_in_doc tool. Passing a URL and keyword scores all sections within the page and returns only the most relevant sections. This enables the MCP server itself to handle the judgment of TOC review β†’ section selection.

A scoring accuracy issue arose: because a parent section’s get_text() includes all child section text, parent scores were always higher. Resolved by excluding child section text from parent scores and deduplicating nested results.

Improvement 7: Caching and Performance Improvements

  • In-memory cache with TTL to eliminate duplicate HTTP fetches of the same URL (5 minutes, LRU limit of 50 entries)
  • buildId cache (1 hour TTL)
  • Connection pooling via shared HTTP client
  • Parallel heading retrieval via asyncio.gather

Improvement Results

Full Page Retrieval Comparison for get_doc_content

Comparison of full page retrieval of the CREATE TABLE page using get_doc_content(include_code_blocks=False).

Condition Before After Notes
max_length=2000 ~2,200 chars (30-40% duplicate) 2,269 chars (no duplicates) ~1.5x information per character
max_length=5000 β€” 5,351 chars New default after improvement
max_length=8000 ~8,200 chars (30-40% duplicate) 8,450 chars (no duplicates) Previous default
max_length=0 102,079 chars (context failure) 20,000 or less (safe) 95%+ reduction

Full Page Retrieval Comparison Across Multiple Pages

Verified the same effect applies to pages other than CREATE TABLE.

Page Before After Reduction
CREATE TABLE (large SQL reference) 70,948 chars 44,503 chars 37.3%
SELECT (large SQL reference) 25,068 chars 10,538 chars 58.0%
Dynamic Tables (medium guide) 5,409 chars 4,529 chars 16.3%
Total 101,425 chars 59,570 chars 41.3%

Reference pages with many links show greater improvement, achieving 58% reduction on the SELECT page. Small guide pages show relatively smaller reduction rates, but since the original size is small, the practical impact is minimal.

Typical Usage Pattern Comparison

Pattern Before After Improvement
search + get_doc_content ~9,600 chars 7,378 chars 23% reduction
search + 2x get_doc_content ~17,800 chars 12,714 chars 29% reduction
search + TOC + 1 section Not possible 9,408 chars New feature
search(headings) + section(code=OFF) Not possible 3,115 chars Optimal flow
Want only Usage Notes Not possible (unreachable) 5,267 chars New feature

The most efficient flow (search(max_results=1, include_headings=True) β†’ get_doc_section(code=OFF)) achieves approximately 58% context reduction compared to the typical pre-improvement flow.

Context Consumption of New Tools

Tool Consumption Use case
search_snowflake_docs (3 results) ~850 chars Search results only
get_doc_toc ~600 chars Checking page structure
get_doc_section (1 section) ~528 chars Retrieving a specific section
search_in_doc (3 sections) ~2,400 chars Auto-extracting relevant sections
get_doc_content (full, code=OFF) ~44,503 chars Entire page

Using search_in_doc to retrieve only needed sections requires about 1/9th the context of a full get_doc_content retrieval.

Impact of include_headings

Pattern Estimated chars
search (5 results, headings=false) ~500 chars
search (5 results, headings=true) ~3,500–5,000 chars

include_headings=true consumes approximately 7-10x more context, but saves the get_doc_toc call. Before the heading cap was introduced, pages with many headings could expand to 26,443+ characters; after introduction, it stays within 10,853 characters (59% reduction).

Key Points for Designing Document MCP Servers

Here is a summary of the insights gained from this improvement work.

  1. Set a hard cap on response size β€” Unlimited retrieval risks context failure
  2. Provide pagination β€” Enable chunked reading with start_index + max_length
  3. Enable gradual information retrieval β€” Search β†’ lightweight TOC check β†’ section retrieval (only what’s needed)
  4. Use a library for HTML to Markdown conversion β€” Hand-written HTML parsers are a source of bugs
  5. Document best practices in instructions β€” Provide guidance from the server side on how the LLM should use the tools
  6. Reduce tool call count β€” Provide options to consolidate multiple tool calls into one
  7. Exclude code blocks by default β€” Code blocks occupy 30-40% of responses on syntax reference pages
  8. Cache to eliminate duplicate HTTP fetches β€” Consecutive access to the same URL is common
  9. Remove noise from Markdown output β€” Remove pilcrow links, copy buttons, link URLs, image tags, etc.
  10. Add safety limits to parameters returning large amounts of data β€” Set caps like heading caps

Summary

The context usage of my custom Snowflake Docs MCP server was measured and the following improvements were made.

Improvement Effect
HTML to Markdown conversion quality Eliminated duplication, noise removal: average 41.3% reduction in full retrieval
Gradual information retrieval Retrieve only needed sections: about 1/9th of full page
Code blocks off by default 37.7% reduction
Reduced tool call count 3 steps β†’ 2 steps
Response size control 102,079 chars β†’ 20,000 or less (prevents failure)
Optimal flow overall ~58% context reduction

When designing MCP servers, it is important to be aware that response size equals LLM context consumption, and to design with the principle of “returning only the necessary information, in the necessary amount, without noise.”

References

Suggest an edit on GitHub