Skip to main content

Announcing our $20m Series A from GV (Google Ventures) and Workday Ventures Read More

Will Leeney Will Leeney · · 8 min read
Autoresearch-Charged Action Search and Benchmark for AI Agents

Autoresearch-Charged Action Search and Benchmark for AI Agents

Table of Contents

A few months ago I wrote about building semantic search for StackOne’s 10,000+ agent actions. The approach worked: synonym-enriched embeddings with MiniLM, 84% Hit@5 across 9,340 actions, sub-10ms latency. We shipped it as our Tool Discovery product and customers noticed the improvement.

Except I couldn’t leave it alone.

The system we shipped was good at matching natural language queries to StackOne SaaS connector actions. But I had no idea how it compared to anything else. Were we at 80% of what’s possible, or 30%? The only way to find out was to benchmark properly. And the thing I really wanted was to set the objectives and let something else do the optimisation. I’d been building autonomous research loops for other domains, so it was time to point one at our own product.

AI Agent Tool Retrieval Benchmark Setup

The first step was a proper evaluation harness. Not just our internal test set, but a gauntlet: our mock-connectors dataset (998 tools, 1,843 hard queries), plus three public benchmarks. MetaTool (200 tools, 2,000 queries), ToolBench (10,439 tools, 451 queries), and ToolRet-full (44,453 tools, 7,961 queries).

Then I added competitors. Not toy baselines — actual published methods:

And for good measure, Anthropic’s native tool use — give Claude Sonnet the full tool catalog and let it pick. About $100 in API calls for that one.

Initial Embedding Model Benchmark Results

On our mock-connectors dataset, the initial numbers were humbling:

StrategyHit@1 (global)
Semantic Router35.4%
Tool2Vec29.4%
StackOne v1 (synonym enrichment)27.6%
Toolshed RAG26.8%

To be clear, v1 wasn’t bad. It was built for a specific job: matching queries to actions within a known connector. And it did that job well (59% Hit@1 scoped, 84% Hit@5 on our production test set). But in a global search across all tools, the synonym dictionaries and domain terms didn’t help as much as I’d assumed. Semantic Router, which does nothing clever at all, just embeds raw descriptions with the same MiniLM model, beats it by 8 points.

The off-the-shelf methods all clustered between 27–35%. The LLM-based approaches were slower (~2,500ms per query) and not dramatically better. Every approach had hit the same ceiling.

Fine-Tuning the Embedding Model: Data Beats Architecture

This is where the autoresearch loop earned its keep. The objective was simple: maximise Hit@1 on held-out connectors across all benchmarks, keep inference under 10ms, keep the model small enough to deploy on a Lambda. The loop handled data preparation, training runs, ablation studies, and evaluation.

The insight that mattered was data engineering, not architecture.

Instead of runtime heuristics (synonym dictionaries, category mappings), we built training data that teaches the model the same distinctions:

  • Cross-connector hard negatives: same action suffix from different connectors like GitHub or Jira. For example, “github_create_issue” vs “jira_create_issue”. The model has to learn that these are different tools, not just “create issue”
  • Same-connector hard negatives: same connector, different action. Forces the model to distinguish between closely related operations
  • LLM-generated paraphrases: 3 per instruction. “Make a new employee”, “add a team member”, “onboard a hire”. Exactly what the synonym dictionary was trying to do, but learned rather than hand-coded
  • Connector-level holdout: 15% of entire connectors held out for validation, not just queries. The model has to generalise to tools it’s never seen

The split matters. If you hold out queries but the model has seen the tools during training, you’re measuring memorisation, not generalisation.

What the Loop Found About MiniLM, BGE, and Hard Negatives

Some of it was predictable. More hard negatives helped (we went from 3 to 7 per example). Doubling the text truncation from 1,000 to 2,000 characters preserved documentation context that turned out to be important.

Some of it surprised me. The fine-tuned MiniLM (22M parameters) beat BGE-base (109M parameters) on our internal data. Less capacity with good training data beats raw model size. Random sampling of training data outperformed every clever curriculum strategy the loop tried. And the biggest single lift (45% on ToolRet-full) came from fixing a format mismatch between training and evaluation. The model was embedding structured descriptions during training but raw documentation during eval. Aligning those was almost half the battle.

The loop also ran hard negative mining: run the model, find the queries it gets wrong, extract the tools it confused, and feed those back as targeted training signal. This is the kind of thing you wouldn’t bother doing manually. It’s tedious and iterative, which is exactly what an autonomous loop is good at.

v2 Embedding Model Results vs MiniLM, BGE, and 7B Models

The result is what we’re calling v2: a fine-tuned BGE-base model (109M params) trained on combined data from all sources. Here’s how it stacks up.

Mock-Connectors — Scoped (998 tools, know which connector)

ModelHit@1Hit@10MRR
v2 (fine-tuned)92.8%100%0.957
v1 (synonyms + MiniLM)59.0%96.8%0.712
BM2536.5%91.6%0.537

Mock-Connectors — Global (all 998 tools)

ModelHit@1Hit@10MRR
v2 (fine-tuned)57.3%82.3%0.661
Anthropic Tool Use44.0%69.3%0.517
v1 (synonyms + MiniLM)27.6%58.7%0.377

ToolRet-full (44,453 tools, 7,961 queries)

ModelnDCG@10Params
v2 (combined-v4)0.544109M
Qwen3-Embedding-8B0.4628,000M
NV-Embed-v10.4277,000M
GritLM-7B0.4117,000M
v1 (synonyms + MiniLM)~0.2622M

MetaTool (200 tools, 2,000 queries)

ModelHit@1Hit@5nDCG@5
v2 (combined-v4)86.9%98.6%0.938
Anthropic Tool Use83.3%96.7%0.913
v1 (synonyms + MiniLM)62.0%87.6%0.762
BM2523.1%52.3%0.385

ToolBench (10,439 tools, 451 queries)

ModelHit@1Hit@10nDCG@5
v2 (combined-v4)20.6%96.0%0.482
Ada-002 (paper)0.387
v1 (synonyms + MiniLM)11.8%51.4%0.255
BM258.2%36.4%0.174

The headline: a 109M parameter model, trained on domain-specific data with proper hard negatives, is #1 on ToolRet-full. It beats Qwen3-8B, NV-Embed-v1, and GritLM-7B, all 60-70x larger. It also beats Anthropic’s native tool use at 30,000x less latency and near-zero marginal cost. Scoped search hits 92.8% Hit@1 and 100% Hit@10.

Deploying the Fine-Tuned Embedding Model on Modal

A model that wins benchmarks is useless if you can’t retrain and deploy it reliably. The autoresearch loop runs on Modal GPUs, so we needed signing keys to ensure only our CI pipeline can trigger training runs. Datasets live in S3 with versioning, so every training run is traceable to a specific data snapshot. The pipeline formats and validates datasets before training starts, because a malformed example at row 40,000 shouldn’t surface as a mysterious accuracy regression two hours into a run.

The most important piece: performance gates. The pipeline won’t promote a new model unless it beats the current production model on all four benchmarks. No regressions allowed. If it produces something worse, it just doesn’t ship. That’s what makes low-supervision retraining viable.

What I Learned About Fine-Tuning Embedding Models for Tool Retrieval

The synonym dictionaries, the domain enrichment, the category context maps: all the hand-crafted work in v1 added maybe 1-2 points over a vanilla embedding baseline. They felt good to write. The kind of thing an engineer builds in a weekend and ships. But they were solving the wrong problem. The real gains came from the training data. Connector-aware splits, structured hard negatives, failure-mined examples, and cross-domain data mixing.

The autoresearch loop didn’t discover any single brilliant insight. It just ran a lot of experiments I wouldn’t have run manually (ablations on negative sampling, format alignment checks, curriculum experiments that mostly didn’t work) and the compound effect of all those small wins added up to a model that doubles or triples accuracy depending on the benchmark.

The v2 model now powers Tool Discovery across StackOne’s MCP gateway and connector library.

If you have a clear metric and a fast feedback loop, pointing an autonomous research process at your own product is probably the highest-leverage thing you can do. I set the objectives, curated the data sources, and decided when something was good enough to ship. The iteration part is increasingly not my job.

Frequently Asked Questions

What is AI agent action search?
AI agent action search is the task of retrieving the correct API action from a catalog of thousands of tools, so the agent can execute the right one for a given query. It's retrieval tuned for tool selection across many connectors. The ToolRet benchmark tests this across 43,000 tools.
How do you fine-tune a small embedding model to beat larger ones for tool retrieval?
Fine-tuning a small embedding model to beat larger ones for tool retrieval requires three things: cross-connector hard negatives (same action suffix from different connectors as training pairs), LLM-generated paraphrases of each instruction, and connector-level holdouts so the model generalises to unseen tools. A 109M BGE-base fine-tuned this way scored #1 on ToolRet-full in our benchmarks, beating Qwen3-8B and NV-Embed-v1.
Why does data engineering beat architecture for embedding model fine-tuning?
Data engineering beats architecture because most embedding models share similar transformer backbones. The lift comes from training data: hard negatives between similar API actions, paraphrases that close the user-vocabulary gap, and connector-level splits that test generalisation. In our experiments, the biggest single lift (45%) came from fixing a format mismatch between training and evaluation, not from changing the model.
What are hard negatives in embedding model training?
Hard negatives in embedding model training are examples that look similar to the correct answer but are wrong, forcing the model to learn fine-grained distinctions. For tool retrieval, that means same-action-suffix-different-connector pairs and same-connector-different-action pairs. The MetaTool benchmark shows LLMs struggle to select tools reliably without this kind of targeted signal.
How does a fine-tuned 109M embedding model compare to using an LLM directly for tool selection?
A fine-tuned 109M embedding model beat Anthropic's native tool use at roughly 30,000x lower latency in our benchmarks (57.3% vs 44.0% Hit@1 on global mock-connectors). Stanford's 2025 AI Index shows inference cost dropping 280x in 18 months, but per-query economics still favor specialized retrievers for high-volume AI agent action selection.
What problems do AI agents face with tool retrieval at enterprise scale?
AI agents at enterprise scale face a tool retrieval problem: with thousands of available tools, picking the right one is harder than executing it. Gartner forecasts that 40% of enterprise apps will feature task-specific agents by 2026, raising the stakes. The AI agent action search model behind StackOne's MCP gateway handles this retrieval problem across our connector library.
Which embedding model is best for RAG?
The best embedding model for RAG depends on the corpus and task. BGE and MiniLM lead general semantic search benchmarks. For specialized tasks like tool retrieval, fine-tuning a small embedding model on domain-specific data with hard negatives outperforms larger general models. Our 109M BGE-base fine-tune beat Qwen3-Embedding-8B and NV-Embed-v1, both 70x larger.
What's a good benchmark for AI agent tool retrieval?
Good benchmarks for AI agent tool retrieval include the Berkeley Function Calling Leaderboard, ToolBench (16,464 RESTful APIs across 49 categories), MetaTool, and ToolRet. Each measures a different dimension: function-call accuracy, scale, when-to-call decisions, and retrieval across 43,000 tools. We evaluate on ToolRet-full, MetaTool, and ToolBench.

Put your AI agents to work

All the tools you need to build and scale AI agent integrations, with best-in-class connectivity, execution, and security.