Hermes Agent Ships Tool Search for MCP: Anthropic Evals Show 49% to 74% Accuracy Gain in Opus 4

nimda May 30, 2026

0 26 6 minutes read

Hermes Agent Ships Tool Search for MCP: Anthropic Evals Show 49% to 74% Accuracy Gain in Opus 4

Nous Research's open source Hermes Agent now ships with a Search Tool feature. It directly addresses a growing bottleneck in AI agent systems: many MCP tools fill the context window. In this introductory article, we'll reveal what Tool Search does, how it works, and when to use it.

Problem: MCP Tools Are Eating Your Window's Context

When connecting multiple MCP (Model Context Protocol) servers to an AI agent, a JSON schema of the tools is sent to the model each time. This happens even if the model only needs one or two tools for a particular task.

Real-world deployments sense this immediately. A Hermes deployment with five MCP servers and 34 devices shows an average information size of 45,000 tokens per turn. About 22,000 of those tokens – about 50% – are schema tools over the head alone.

Anthropic's own engineering data shows tool specifications can consume 134,000 tokens before development. Tool Attention measures the “MCP Tool Tax” at 15,000–60,000 tokens per turn with the typical use of multiple servers.

This creates two different problems:

Costs: Generations missed from the cache at the beginning of the session can cost $0.07–$0.10 per turn.
Loss of accuracy: Decision paralysis sets in when the model sees hundreds of irrelevant tool options at once.

Source: hermes-agent.nousresearch.com/docs · Nous Research 2026

Tool Search is Hermes Agent's ongoing disclosure layer for MCP and non-essential plugin tools. Instead of loading the entire tool schema upfront, the model loads only what it needs – on demand, per turn.

When Tool Search is active, MCP and plugin tools are replaced in the list of visible tools by three bridge tools:

tool_search(query, limit?)   — search the deferred-tool catalog
tool_describe(name)          — load the full schema for one tool
tool_call(name, arguments)   — invoke a deferred tool

A typical connection looks like this:

Model: tool_search("create a github issue")
  → { matches: [{ name: "mcp_github_create_issue", ... }] }
Model: tool_describe("mcp_github_create_issue")
  → { parameters: { type: "object", properties: { ... } } }
Model: tool_call("mcp_github_create_issue", { title: "...", body: "..." })
  → { ok: true, issue_number: 42 }

The model searches for what it needs, loads the schema, and calls the tool. All hooks, guardrails, and approval commands are against the name of the original original instrument – not against the bridge.

Accuracy Numbers

This is not just a token saving feature. Search tool too improves the accuracy of the model in the MCP test.

According to Anthropic MCP's internal testing:

Claude Opus 4: improved accuracy from 49% → 74% with search tools enabled
Claude Opus 4.5: improved accuracy from 79.5% → 88.1% with search tools enabled

Large catalogs of tools create “decision paralysis” – the model is confused to choose between many irrelevant options. Removing those options from the context window reduces false positives. Anthropic data also shows i 85% reduction in tool definition token usage while maintaining access to the full tool library.

Functional Recovery Method: BM25 + Fallback

Under the hood, Hermes uses BM25 — a classic information retrieval algorithm — matching a model query with a catalog of tool names, definitions, and parameter names.

If BM25 does not return hits with positive results, the system reverts to matching the original substring to the tool name. This protects against degenerate zero-IDF scenarios, such as search "github" in the catalog where every tool name contains “github.”

Catalog i flawless in repentance. Rebuild from the current list of tools for every assembly. This prevents drift bugs when the database catalog is out of sync with live tool registrations.

By default, Tool Search is enabled auto mode. It only works if the executable schemas can use it at least 10% of the active model context window.

Below that threshold, the integration of the tool array is a clean pass. You don't pay a lot of money.

This decision is reviewed regularly:

A session with few MCP tools and a remote content model may never open Tool Search.
A session with multiple MCP servers attached (typically 15+ devices) starts it up.
Removing servers in the middle of a session correctly reverts to direct exposure of the tool in the next compilation.

Configuration reference

Add this to yours hermes.yaml control behavior:

tools:
  tool_search:
    enabled: auto        # auto (default), on, or off
    threshold_pct: 10    # % of context at which auto mode kicks in
    search_default_limit: 5
    max_search_limit: 20

The key	Default	Explanation
`enabled`	`auto`	`auto` activate above the limit; `on` it is always active if there is at least one undoable tool; `off` it is completely disabling
`threshold_pct`	`10`	The percentage of the length of the context there `auto` he kicks. Range: 0–100
`search_default_limit`	`5`	Hits back when the model calls `tool_search` except a `limit`
`max_search_limit`	`20`	A tightly bound model can be requested with `limit`. Range: 1-50

You can also use a simple boolean shorthand:

tools:
  tool_search: true   # equivalent to {enabled: auto}

Marktechpost Visual Explainer

Nous Research – Hermes Agent
01 / 07

Tool Search: Troubleshooting the MCP Content Window

When multiple MCP servers connect to an agent, each of its JSON schemas loads the model context every time – even when only one tool is needed. Hermes Agent's Search Tool fixes this with continuous schema disclosure.

~22K
tokens/convert up
on 5 servers, 34 device setup

85%
reduce tool-definition
token usage (Anthropic data)

134K
tokens consumed by tool defs
before doing well (Anthropic)

The problem
02 / 07

MCP Equipment Tax

Every connected MCP server dumps its full JSON schema in the foreground. With many servers, this drowns out the real conversation and forces the model to choose from hundreds of irrelevant tools, causing decision paralysis.

The research paper arXiv 2604.21816 (“Tool of Attention”) measures the MCP Tool Tax on 15,000—60,000 tokens per turn. Cache misses can be expensive $0.07—$0.10 per opportunity in the implementation of the API.

GitHub: 35 tools — ~26K tokens
Slack: 11 tools – ~21K tokens
Jira: ~ 17K tokens alone

The five server setup approaches 100K+ more tokens before the chat starts.

What's going on
03 / 07

Instrumental Search: A Framework for Continuous Disclosure

Tool Search is Hermes Agent's entry feature that replaces all MCP tool schemes in the visible tool list with just three lightweight bridge tools. The model loads the schema for each tool on demand – only when it really needs it.

search_tool(query, limit?)
tool_define(name)
tool_call(name, arguments)

All hooks, guardrails, and approval instructions still apply – against the name of the original original instrument, not the bridge. The CLI task feed also unwraps to show the actual tool, not the bridge.

How It Works
04 / 07

A Three Step Recovery Sequence

tool_search
BM25 query against tool name, description and parameters

tool_describe
Loads the full JSON schema of the matched tool into the context

tool_call
Dismantling the bridge – the original tool works with full guards

Model: tool_search(“create a github issue”) → { same: [{ name: “mcp_github_create_issue” }] } Model: tool_describe(“mcp_github_create_issue”) → { parameters: { type: “object”, properties: {…} } } Model: tool_call(“mcp_github_create_issue”, { title: “…” }) → { ok: true, issue_number }: }

Accuracy of Results
05/07

Anthropic MCP Evals Show Significant Accuracy Gains

Large catalogs of tools cause decision paralysis. Removing irrelevant schemas from context reduces false positives. Internal testing of Anthropic's MCP shows significant accuracy improvements with Tool Search enabled.

49% → 74%
Claude Opus 4
accuracy of the MCP test

79.5% → 88.1%
Claude Opus 4.5
accuracy of the MCP test

Note: ~26 percentage points of accuracy is still a retrieval failure in Opus 4. Smaller models perform less faithfully in query formulation. Tool Search assumes that the model can write a meaningful search query.

Configuration
06/07

Setting Tool Search in hermes.yaml

tools: tool_search: enabled: default # default (default), on, or off threshold_pct: 10 # % of content — default mode only search_default_limit: 5 max_search_limit: 20 # Shorthand: tools: tool_search: true # equivalent to {enabled: auto}

The key	Default	Explanation
enabled	automatic	operate automatically above the limit; always open and active; turn off disable
threshold_pct	10	% of context length where default mode enters. Range: 0—100
search_default_limit	5	The priority is returned if the model calls tool_search without a limit
maximum_search_limit	20	A model that is tightly bound at the top can ask for a limit. Range: 1-50

Key Takeaways
07/07

When You Should Use It — And When You Shouldn't

✓ 15+ attached tools
✓ Fewer tools are used per curve
✓ Multiple MCP servers
⚠ Small tools — maximum sum
⚠ All tools take advantage of each opportunity

Bridge tools cost ~300 tokens + at least one extra round trip per cold tool
Deferred schemas do not get the benefit of the system cache prefix
Catalog is stateless – rebuilds everywhere, preventing drift bugs
Security-scoped: the bridge cannot access devices outside the device sets assigned to the session
Core Hermes tools (terminal, read_file, web_search, send_message…) are never deprecated

Source: hermes-agent.nousresearch.com/docs – Anthropic Engineering Blog – Nous Research 2026

Key Takeaways

Tool Search defers MCP tool schemes until the model actually needs them — using a tool_search / tool_describe / tool_call the bridge.
Anthropic testing shows accuracy gains from 49% → 74% on Claude Opus 4 with large tool catalogs.
Retrieving BM25 over tool name + description + parameter names enables the search, with a small reverse sequence of zero IDF edge conditions.
auto The (default) mode is self-configuring — only active if tooltips exceed 10% of the context window.
Core Hermes tools have never been reversed; only MCP and non-core plugin tools are eligible.

Check it out Hermes Agent Tool Search Documents again Advanced Anthropic Tool Use. Also, feel free to follow us Twitter and don't forget to join our 150k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? Connect with us

Source link

nimda May 30, 2026

0 26 6 minutes read

Hermes Agent Ships Tool Search for MCP: Anthropic Evals Show 49% to 74% Accuracy Gain in Opus 4

Problem: MCP Tools Are Eating Your Window's Context

Accuracy Numbers

Functional Recovery Method: BM25 + Fallback

Configuration reference

Marktechpost Visual Explainer

Key Takeaways

nimda

Leave a Reply Cancel reply

Subscribers, Revenue, Market Share & Global Reach

5-return back to the base

Gemma 3 270m: Model of a hyper-effective compact of AI

Most RAG Hallucinations Are Retrieval Failures: How the Retrieval Brick Decides What the Model Can Invent

Cut researchers present the work that calls llms: Eliminating SQL relief to improve the accuracy of information and efficiency

OASIS: Simuleringar av social interaction mellan en miljon agent

FALCON 3 models are now available at Amazon Sagemaker Jumpstart

This AI paper introduces codesters: Physical models are symbolic language with code / guide

Meta SAM 2.1 is now available in Amazon SageMaker JumpStart

Problem: MCP Tools Are Eating Your Window's Context

Accuracy Numbers

Functional Recovery Method: BM25 + Fallback

Configuration reference

Marktechpost Visual Explainer

Key Takeaways

nimda

Subscribe to our mailing list to get the new updates!

How to Use AgentTrove: Streaming 1.7M Agentic Traces and Building a ShareGPT SFT Dataset in Python

Genesis AI Releases Nyx, Quadrants, and Genesis World 1.0 Physics Platform for Scalable Robotic Foundation Model Evaluation

Related Articles

Google Releases LiteRT.js: A JavaScript Binding for LiteRT Using .tflite Models in Browsers with WebGPU

PrismML Releases Bonsai 27B: 1-bit and Ternary Builds of Qwen3.6-27B Running on Laptops and Phones

Mistral Vibe Codex vs Claude Codex vs Cursor vs Codex: Four Agents Scored in One Scaffold-to-PR Job

LLM Evaluation Frameworks Compared: How to Actually Measure What Your Model Does

Leave a Reply Cancel reply

Subscribers, Revenue, Market Share & Global Reach

5-return back to the base

Gemma 3 270m: Model of a hyper-effective compact of AI

Most RAG Hallucinations Are Retrieval Failures: How the Retrieval Brick Decides What the Model Can Invent

Cut researchers present the work that calls llms: Eliminating SQL relief to improve the accuracy of information and efficiency

OASIS: Simuleringar av social interaction mellan en miljon agent

FALCON 3 models are now available at Amazon Sagemaker Jumpstart

This AI paper introduces codesters: Physical models are symbolic language with code / guide

Meta SAM 2.1 is now available in Amazon SageMaker JumpStart