Anthropic's Tool Search: Not Ready for Production Marketing Workflows

I’ve been testing Anthropic’s Tool Search feature. The direction is promising, but I’m not ready to deploy it in production. Here’s what marketing teams need to know before implementing it.

Open Table of contents

What Tool Search Actually Does
The Two Search Variants
Real-World Implementation Example
MCP Integration and the Bigger Picture
The Arcade.dev Reality Check
Current Limits and Constraints
How to Implement Tool Search
Where This Technology Needs to Go
What I’d Do Now

What Tool Search Actually Does

Tool Search tackles a real problem: context window bloat. Standard tool calling loads all tool definitions into Claude’s context window upfront. With 50 tools, that’s roughly 10,000 to 20,000 tokens consumed before you’ve started any work.

Tool Search flips this. Claude searches your tool catalogue dynamically and loads only what it needs. You mark tools with defer_loading: true in your API request, and Claude discovers them on-demand through either regex pattern matching or BM25 natural language search.

When you enable Tool Search, Claude initially sees only the search tool itself and any non-deferred tools. When it needs something else, it searches using patterns like "weather" (regex variant) or natural language queries like “tools for sending emails” (BM25 variant). The API returns three to five relevant tool references, which are automatically expanded into full definitions.

The Two Search Variants

Anthropic offers two approaches. The regex variant (tool_search_tool_regex_20251119) uses Python’s re.search() syntax. Common patterns include "weather" for exact matches, "get_.*_data" for flexible matching, or "(?i)slack" for case-insensitive searches. Maximum query length is 200 characters.

The BM25 variant (tool_search_tool_bm25_20251119) accepts natural language queries instead. It’s simpler for marketing teams without regex knowledge. Both variants search across tool names, descriptions, argument names, and argument descriptions.

Feature	Regex Variant	BM25 Variant
Query format	Python regex patterns	Natural language
Ease of use	Requires regex knowledge	More intuitive
Precision	High with good patterns	Semantic understanding
Max query length	200 characters	200 characters

Real-World Implementation Example

One of our clients ran a marketing automation workflow connecting Gmail, Slack, HubSpot, and two analytics platforms. Their initial implementation loaded 50+ tool definitions upfront, consuming nearly 15,000 tokens before any work began. That meant slower responses and less context space for campaign analysis.

We tested Tool Search as an alternative, keeping their three most-used tools (fetch campaign data, create tasks, send notifications) as non-deferred and deferring everything else. Token savings looked good, but retrieval accuracy stopped us from going to production.

The official documentation covers the full technical spec. Here’s a simplified structure:

{
  "model": "claude-sonnet-4-6-20250929",
  "tools": [
    {
      "type": "tool_search_tool_bm25_20251119",
      "name": "tool_search_tool_bm25"
    },
    {
      "name": "send_email",
      "description": "Send email via Gmail",
      "defer_loading": true,
      "input_schema": {...}
    }
  ]
}

MCP Integration and the Bigger Picture

Tool Search integrates with Anthropic’s Model Context Protocol (MCP). MCP standardises connections between AI agents and external tools — we think it will become the connective tissue of the marketing stack.

With the mcp-client-2025-11-20 beta header, you can defer loading MCP tools using default_config. This matters most when connecting multiple MCP servers. The vision is AI agents navigating entire martech stacks through natural language, but current retrieval accuracy holds that back.

The Arcade.dev Reality Check

The team at Arcade.dev ran a thorough test, loading 4,027 tools and running 25 straightforward workflows. These weren’t edge cases — they were everyday agentic tasks like “send an email to my colleague” or “post a message to Slack”.

Regex search hit 56% retrieval accuracy (14 out of 25 tasks). BM25 did marginally better at 64% (16 out of 25). Worse, common tools failed basic retrieval: “send email” prompts couldn’t find Gmail_SendEmail, “post a message to Slack” missed Slack_SendMessage, and ticket creation requests failed to surface Zendesk_CreateTicket.

“When ‘send an email’ can’t find Gmail_SendEmail, there’s still work to do.”

Eric Gustin, Arcade.dev

This isn’t about selection or parameterisation accuracy — it’s purely retrieval. Did the correct tool even appear in search results?

Current Limits and Constraints

Anthropic supports up to 10,000 tools in your catalogue, returning three to five relevant tools per search. The feature works with Claude Sonnet 4.5, Sonnet 4.6, Opus 4.5, and Opus 4.6 — no Haiku support. It’s still in public beta and requires the advanced-tool-use-2025-11-20 header.

Tool Search doesn’t work with tool use examples, so teams relying on few-shot prompting will need a workaround. Regex patterns are capped at 200 characters, which means you’ll need to design patterns carefully. Common error codes: invalid_pattern for malformed regex, pattern_too_long for exceeding limits, and too_many_requests for rate limits.

How to Implement Tool Search

Based on our testing, here’s what worked:

Audit your current tool catalogue and usage patterns
Identify three to five most frequently accessed tools
Keep those tools non-deferred for immediate availability
Rewrite remaining tool descriptions with semantic keywords
Test retrieval accuracy with realistic marketing workflows
Monitor tool discovery logs to identify misses
Iterate on descriptions based on discovery patterns

When writing tool descriptions, think about how marketers actually describe tasks. Skip the technical jargon — use phrases like “send campaign emails” or “fetch conversion data from analytics”. The BM25 variant rewards clear, natural-language descriptions.

Where This Technology Needs to Go

The architecture makes sense: defer tool loading to avoid context bloat, discover tools just-in-time, keep interactions lightweight. The efficiency gains are real and add up fast at scale. But 60% retrieval accuracy isn’t production-ready when agents need to reliably take real-world actions.

“The future of user interaction will not be in the web browser. Traditional software applications will become predominantly headless, backend platforms that provide data and functions to AI agents via standards such as MCP.”

Jensen Huang, President and CEO of NVIDIA

For marketing teams, the promise is still compelling. Imagine “show me last week’s organic traffic to product pages” automatically finding the right PostHog or GA4 tool, fetching data, and formatting results. We’re building towards that at Growth Method, but we’re not there yet.

What I’d Do Now

Tool Search points in the right direction. The token savings are meaningful, and natural language tool discovery would be a step change for marketing teams managing bloated tech stacks.

But when nearly half of tool searches fail before you even reach selection and parameterisation, you can’t put it in production. Marketing workflows need “send the campaign report” to find the right email tool every time, not half the time.

For now, stick with traditional tool calling in production. Keep an eye on retrieval accuracy improvements, and get your tool catalogue ready — clear descriptions, semantic keywords, sensible naming. When retrieval gets reliable, the teams with well-structured catalogues will move fastest.