I’ve been testing Anthropic’s Tool Search feature. The direction is promising, but I’m not ready to deploy it in production. Here’s what marketing teams need to know before implementing it.
Table of contents
Open Table of contents
What Tool Search Actually Does
Tool Search tackles a real problem: context window bloat. Standard tool calling loads all tool definitions into Claude’s context window upfront. With 50 tools, that’s roughly 10,000 to 20,000 tokens consumed before you’ve started any work.
Tool Search flips this. Claude searches your tool catalogue dynamically and loads only what it needs. You mark tools with defer_loading: true in your API request, and Claude discovers them on-demand through either regex pattern matching or BM25 natural language search.
When you enable Tool Search, Claude initially sees only the search tool itself and any non-deferred tools. When it needs something else, it searches using patterns like "weather" (regex variant) or natural language queries like “tools for sending emails” (BM25 variant). The API returns three to five relevant tool references, which are automatically expanded into full definitions.
The Two Search Variants
Anthropic offers two approaches. The regex variant (tool_search_tool_regex_20251119) uses Python’s re.search() syntax. Common patterns include "weather" for exact matches, "get_.*_data" for flexible matching, or "(?i)slack" for case-insensitive searches. Maximum query length is 200 characters.
The BM25 variant (tool_search_tool_bm25_20251119) accepts natural language queries instead. It’s simpler for marketing teams without regex knowledge. Both variants search across tool names, descriptions, argument names, and argument descriptions.
| Feature | Regex Variant | BM25 Variant |
|---|---|---|
| Query format | Python regex patterns | Natural language |
| Ease of use | Requires regex knowledge | More intuitive |
| Precision | High with good patterns | Semantic understanding |
| Max query length | 200 characters | 200 characters |
Real-World Implementation Example
One of our clients ran a marketing automation workflow connecting Gmail, Slack, HubSpot, and two analytics platforms. Their initial implementation loaded 50+ tool definitions upfront, consuming nearly 15,000 tokens before any work began. That meant slower responses and less context space for campaign analysis.
We tested Tool Search as an alternative, keeping their three most-used tools (fetch campaign data, create tasks, send notifications) as non-deferred and deferring everything else. Token savings looked good, but retrieval accuracy stopped us from going to production.
The official documentation covers the full technical spec. Here’s a simplified structure:
{
"model": "claude-sonnet-4-6-20250929",
"tools": [
{
"type": "tool_search_tool_bm25_20251119",
"name": "tool_search_tool_bm25"
},
{
"name": "send_email",
"description": "Send email via Gmail",
"defer_loading": true,
"input_schema": {...}
}
]
}
MCP Integration and the Bigger Picture
Tool Search integrates with Anthropic’s Model Context Protocol (MCP). MCP standardises connections between AI agents and external tools — we think it will become the connective tissue of the marketing stack.
With the mcp-client-2025-11-20 beta header, you can defer loading MCP tools using default_config. This matters most when connecting multiple MCP servers. The vision is AI agents navigating entire martech stacks through natural language, but current retrieval accuracy holds that back.
The Arcade.dev Reality Check
The team at Arcade.dev ran a thorough test, loading 4,027 tools and running 25 straightforward workflows. These weren’t edge cases — they were everyday agentic tasks like “send an email to my colleague” or “post a message to Slack”.
Regex search hit 56% retrieval accuracy (14 out of 25 tasks). BM25 did marginally better at 64% (16 out of 25). Worse, common tools failed basic retrieval: “send email” prompts couldn’t find Gmail_SendEmail, “post a message to Slack” missed Slack_SendMessage, and ticket creation requests failed to surface Zendesk_CreateTicket.
“When ‘send an email’ can’t find Gmail_SendEmail, there’s still work to do.”
Eric Gustin, Arcade.dev
This isn’t about selection or parameterisation accuracy — it’s purely retrieval. Did the correct tool even appear in search results?
Current Limits and Constraints
Anthropic supports up to 10,000 tools in your catalogue, returning three to five relevant tools per search. The feature works with Claude Sonnet 4.5, Sonnet 4.6, Opus 4.5, and Opus 4.6 — no Haiku support. It’s still in public beta and requires the advanced-tool-use-2025-11-20 header.
Tool Search doesn’t work with tool use examples, so teams relying on few-shot prompting will need a workaround. Regex patterns are capped at 200 characters, which means you’ll need to design patterns carefully. Common error codes: invalid_pattern for malformed regex, pattern_too_long for exceeding limits, and too_many_requests for rate limits.
How to Implement Tool Search
Based on our testing, here’s what worked:
- Audit your current tool catalogue and usage patterns
- Identify three to five most frequently accessed tools
- Keep those tools non-deferred for immediate availability
- Rewrite remaining tool descriptions with semantic keywords
- Test retrieval accuracy with realistic marketing workflows
- Monitor tool discovery logs to identify misses
- Iterate on descriptions based on discovery patterns
When writing tool descriptions, think about how marketers actually describe tasks. Skip the technical jargon — use phrases like “send campaign emails” or “fetch conversion data from analytics”. The BM25 variant rewards clear, natural-language descriptions.
Where This Technology Needs to Go
The architecture makes sense: defer tool loading to avoid context bloat, discover tools just-in-time, keep interactions lightweight. The efficiency gains are real and add up fast at scale. But 60% retrieval accuracy isn’t production-ready when agents need to reliably take real-world actions.
“The future of user interaction will not be in the web browser. Traditional software applications will become predominantly headless, backend platforms that provide data and functions to AI agents via standards such as MCP.”
Jensen Huang, President and CEO of NVIDIA
For marketing teams, the promise is still compelling. Imagine “show me last week’s organic traffic to product pages” automatically finding the right PostHog or GA4 tool, fetching data, and formatting results. We’re building towards that at Growth Method, but we’re not there yet.
What I’d Do Now
Tool Search points in the right direction. The token savings are meaningful, and natural language tool discovery would be a step change for marketing teams managing bloated tech stacks.
But when nearly half of tool searches fail before you even reach selection and parameterisation, you can’t put it in production. Marketing workflows need “send the campaign report” to find the right email tool every time, not half the time.
For now, stick with traditional tool calling in production. Keep an eye on retrieval accuracy improvements, and get your tool catalogue ready — clear descriptions, semantic keywords, sensible naming. When retrieval gets reliable, the teams with well-structured catalogues will move fastest.