Search and Execute: Semantic Tool Discovery

As agent tool catalogs grow, loading every available action into context becomes impractical. Search and Execute solves this with two meta-tools: tool_search and tool_execute. Instead of loading hundreds of definitions upfront, the agent searches for what it needs and calls it.

This release upgrades the search layer from lexical matching to a fine-tuned semantic model, building on Agent Tools Discovery. In practice: a query like “enroll new hire in training” now correctly finds workday_create_learning_enrollment with no keyword overlap. The model was trained on SaaS actions specifically, so it handles the vocabulary gap between how people describe tasks and how APIs name them.

On accuracy, it hits 92.8% Hit@1 on scoped search and ranks #1 on ToolRet-full (a public benchmark of 44,453 tools), ahead of Qwen3-8B and NV-Embed-v1, up to 73x larger. Globally across all connectors, it reaches 57.3% Hit@1 versus 44% for Anthropic’s native tool use.

What’s New

Fine-tuned semantic model: BGE-base-en-v1.5 (109M params), trained on SaaS actions with cross-connector hard negatives and LLM-generated query paraphrases. Replaces the previous lexical approach and generalises to connectors it hasn’t seen in training
Three search modes: semantic (API-backed, highest accuracy), local (in-process BM25+TF-IDF, no network calls), and auto (semantic with local fallback). Available across the TypeScript and Python SDKs, the live MCP, and the StackOne playground
Available via MCP: Add ?tool-mode=search_execute to the MCP URL. Works with any MCP client without custom headers. Omit the parameter and the server returns the full tool list as before
tool_search and tool_execute: The two meta-tools — renamed from meta_search_tools and meta_execute_tool. tool_search accepts query, topK, and minScore; tool_execute resolves account IDs from connector type automatically and supports multi-execute arrays in a single call

Setup

Search and Execute is available via MCP using tool-mode=search_execute, or through the TypeScript and Python SDKs.

MCP

https://api.stackone.com/mcp?tool-mode=search_execute&x-account-id=your_account_id

TypeScript

npm install @stackone/ai

const tools = await toolset.searchTools("manage employee time off", {
  topK: 5,
  search: "auto",
});
const openAITools = tools.toOpenAI();

// Or use meta-tools in an agent loop
const { tool_search, tool_execute } = await toolset.getTools();

Python

pip install stackone-ai

from stackone_ai import StackOneToolSet

toolset = StackOneToolSet()

# Search for tools
tools = toolset.search_tools("manage employee time off", top_k=5)
openai_tools = tools.to_openai()

# Or use meta-tools in an agent loop
tools = toolset.openai(mode="search_and_execute")

Benchmark

We benchmarked v2 against the previous synonym-enrichment approach (v1), BM25, and Anthropic’s native tool use. We also ran against ToolRet-full, a public benchmark across 44,453 tools.

Scoped search: 998 tools, 1,843 queries (connector known)

Model	Hit@1	Hit@10	MRR
v2 (fine-tuned BGE-base)	92.8%	100%	0.957
v1 (synonym enrichment)	59.0%	96.8%	0.712
BM25	36.5%	91.6%	0.537

Global search: 998 tools, 1,843 queries (no connector hint)

Model	Hit@1	Hit@10	MRR
v2 (fine-tuned BGE-base)	57.3%	82.3%	0.661
Anthropic Tool Use (Claude Sonnet)	44.0%	69.3%	0.517
v1 (synonym enrichment)	27.6%	58.7%	0.377

ToolRet-full: 44,453 tools, 7,961 queries

Model	nDCG@10	Params
v2 (fine-tuned BGE-base)	0.544	109M
Qwen3-Embedding-8B	0.462	8,000M
NV-Embed-v1	0.427	7,000M
GritLM-7B	0.411	7,000M
v1 (synonym enrichment)	~0.26	22M

For the methodology behind the v2 model, read How a 109M Embedding Model Beats 8B on Tool Retrieval.