How to Keep MCPs Useful for Agent Pipelines

Introduction
applications supported by large language models (LLMs) require integration with external services, for example integration with Google Calendar to set up meetings or integration with PostgreSQL to access specific data.
Calling function
Initially these types of integration were used by calling a function: we were creating some special functions that could be called by LLM by using some special tokens (LLM was generating some special tokens to call a function, following the patterns we defined), parsing and executing. For activation we used authentication and API call methods for each tool. Importantly, we had to manage all the instructions for these tools to be called and build internal understanding of these functions including defaults or user-specific parameters. But the hype around “AI” required quick, sometimes brutal solutions to keep pace, which is where MCPs were introduced by the Anthropic company.
MCPs
MCP stands for Model Context Protocol and today is the standard way to provide tools for most agent pipelines. MCPs basically handle both the integration tasks and the LLM instructions to use the tools. At this point some may argue that the Capabilities and Code usage that Anthropic recently introduced killed MCPs, but in fact these features often use MCPs to compile and manage instructions (Coding with MCP – Anthropic). Skills and Coding focuses on the problem of context management and tooling, which is a different problem from what MCPs focus on.
MCPs provide a common way to integrate different services (tools) with LLMs and provide commands that LLMs use to drive tools. However, here are a few problems:
- The context protocol of the current model suggests that all the parameters of the instrument call will be expressed in the LLM, and all their values should be generated by the LLM. For example, that means that LLM should generate a user id value if the job call requires it. That is more because the system, the application knows the value of the user id without the need of LLM to generate it, in addition to make LLM informed about the value of the user id we have to set it quickly (there is a way to “hide arguments” in FastMCP from gofastmcp that focuses on this problem, but I have not seen it in the implementation of Anthropic MCP from Anthropic).
- There is no out-of-the-box control over commands. MCPs provide a definition for each tool and a definition for each tool argument so that these values can be used blindly in agency pipelines as LLM API call parameters. And the description is provided by each different developer of the MCP server.
System information and tools
When you call LLMs you usually provide the tools in the LLM call as a parameter to the API call. The value of this parameter is returned to MCP's list_tools function which returns a JSON schema of the tools we have.
At the same time this “tools” parameter is used to put additional information in the model system information. For example, the Qwen3-VL model has a chat_template that controls the installation of tools in the system as follows:
“...You are provided with function signatures within XML tags:\n" }}n {%- for tool in tools %}n {{- "\n" }}n {{- tool | tojson }}n {%- endfor %}...”
So the tool definitions end up being available in the LLM system notification you're running.
The first problem is actually partially solved by the mentioned “conflict hiding” method from FastMCP, but nevertheless I have seen other solutions where values like “user id” are pushed to the model system to use it in calling the tool – it is fast and very easy to use from an engineering point of view (actually no engineering is needed to just put it in the LLM system quickly and reuse it). So here I focus on the second problem.
At the same time I leave aside the problems related to the tons of garbage MCPs on the market – some of them do not work, others have produced a tool description that can confuse the model. The problem I'm focusing on here – the unusual tools and their parameter definitions which may be why LLMs don't behave well with other tools.
Instead of the end of the introductory part:
If your LLM-enabled pipeline fails with the tools you have, you can:
- Just choose the most powerful, modern and expensive LLM API;
- Revisit your tools and overall instructions.
Both can work. Make your decision or ask your AI assistant to make a decision for you…
The official part of the work – research
1. Examples of different definitions
Based on the search for real MCPs on the market, looking at their tool lists and descriptions, I can find many examples of the mentioned problem. Here I give one example from two different MCPs with different domains (in real life situations the list of MCPs the model uses often have different domains):
Example 1:
Tool description: “Generate a location chart to show data trends under independent continuous variables and observe all data trends, such as, displacement = velocity (average or instantaneous) × time: s = v × t. If the x-axis is time
“Data” field definition: “The data of the field chart, should be an array of objects, each object contains a `time` field and a `value` field, such as, [{ time: ‘2015’, value: 23 }, { time: ‘2016’, value: 32 }]when grouping is required locally, the data must contain a `group` field, such as, [{ time: ‘2015’, value: 23, group: ‘A’ }, { time: ‘2015’, value: 32, group: ‘B’ }].”
Example 2:
Tool description: “Search Airbnb listings with various filters and typing. Provide direct links to the user”,
Definition of “Location”: “Location to be searched (city, state, etc.)”
Here I am not saying that any of these definitions are wrong, they are just very different in format and idea of details.
2. Data set and benchmark
To prove that different tool definitions can change the behavior of the model I used NVidia's “When2Call” dataset. In this dataset I have taken test samples with multiple modeling tools to choose from and one tool is the correct choice (it is better to call one tool than any other or rather to give a text response without any tool call, depending on the dataset). The idea of the benchmark is to count correct and incorrect tool calls, and I count cases of “no tool call” as a wrong answer. For LLM I chose “gpt-5-nano” for OpenAI.
3. Processing of data
The original dataset provides a description of just one instrument. To create some definitions for each tool and parameter I used “gpt-5-mini” to generate it based on the current one with the following command to concatenate it (after the generation there was an additional step to verify and recreate when necessary):
“””You will get the tool description in JSON format. Your task is to make the tool description more detailed, so that it can be used by the weak model.
One way to integrate — include a detailed description of how it works and examples of how it's used.
Example of detailed descriptions:
Tool description: “Generate a location chart to show data trends under independent continuous variables and observe all data trends, such as, displacement = velocity (average or instantaneous) × time: s = v × t. If the x-axis is time
Property definition: “The property chart data, should be an array of items, each item contains a `time` field and a `value` field, such as, [{ time: ‘2015’, value: 23 }, { time: ‘2016’, value: 32 }]when grouping is required locally, the data must contain a `group` field, such as, [{ time: ‘2015’, value: 23, group: ‘A’ }, { time: ‘2015’, value: 32, group: ‘B’ }].”
Return a strictly updated detailed description in JSON format (just change the descriptions, don't change the structure of the inputted JSON). Start your answer with:
“New JSON format: …”
“””
4. Tests
To test the hypothesis I conducted several experiments, namely:
- Measure the baseline performance of the model on the selected benchmark (Baseline);
- Replace the correct tool definitions (including both the tool definition itself and the parameters definitions — the same for all tests) with the generated one (The correct tool is replaced);
- Replace the description of the wrong tool with the one that was produced (Wrong tool replaced);
- Replace all tool definitions with generated (All tools replaced).
Here is a table with the results of this test (for each of the 5 tests performed, so in addition to the standard deviation of the precision (std) is provided):
| The way | It means accuracy | Accuracy Std | High accuracy over 5 tests |
| The foundation | 76.5% | 0.03 | 79.0% |
| The right tool has been replaced | 80.5% | 0.03 | 85.2% |
| The wrong tool has been replaced | 75.1% | 0.01 | 76.5% |
| All tools have been replaced | 75.3% | 0.04 | 82.7% |
The conclusion
From the table above it can be seen that the complexity of the instruments introduces a bias in the model, the selected LLM tends to choose the instrument with a more detailed description. At the same time we can see that the extended description can confuse the model (in case all the instruments are changed).
The table shows that the description of the tools provides ways to manipulate and significantly adjust the behavior / accuracy of the model, especially considering that the selected benchmark works with a small number of tools in each model call, the average number of tools used in each sample is 4.35.
At the same time it clearly shows that LLMs can have a tool bias that can be misused by MCP providers, which can be the same bias as what I reported earlier – style bias. Research on bias and its misuse can be important for further studies.
Engineering is the solution
I have prepared a tool PoC to address the problem mentioned in practice – Master-MCP. Master-MCP is an MCP proxy server that can be connected to any number of MCPs and can be connected to the agent / LLM as one MCP server itself (currently the stdio-transport MCP server). Master-MCP default features I installed:
- Ignore other parameters. Implemented tools exclude all parameters starting with the symbol “_” in the tool parameter schema. Later this parameter can be entered programmatically or use the default value (if provided).
- Tool definition correction. Master-MCP collects all tools and their definitions from connected MCP servers and provides the user with a way to configure them. It exposes a way with a simple UI to organize this list (JSON-schema), so that the user can experiment with different tool definitions.
I invite all who are interested to join the project. With the help of the community the programs can include a functional extension of the Master-MCP, for example:
- Logging and monitoring followed by advanced analysis;
- The suite of tools and orchestration (including ML enabled) combines both modern context management techniques and intelligent algorithms.
The project's current github page: link



