Will Leeney · · 8 min read Autoresearch-Charged Action Search and Benchmark for AI Agents
Table of Contents
A few months ago I wrote about building semantic search for StackOne’s 10,000+ agent actions. The approach worked: synonym-enriched embeddings with MiniLM, 84% Hit@5 across 9,340 actions, sub-10ms latency. We shipped it as our Tool Discovery product and customers noticed the improvement.
Except I couldn’t leave it alone.
The system we shipped was good at matching natural language queries to StackOne SaaS connector actions. But I had no idea how it compared to anything else. Were we at 80% of what’s possible, or 30%? The only way to find out was to benchmark properly. And the thing I really wanted was to set the objectives and let something else do the optimisation. I’d been building autonomous research loops for other domains, so it was time to point one at our own product.
AI Agent Tool Retrieval Benchmark Setup
The first step was a proper evaluation harness. Not just our internal test set, but a gauntlet: our mock-connectors dataset (998 tools, 1,843 hard queries), plus three public benchmarks. MetaTool (200 tools, 2,000 queries), ToolBench (10,439 tools, 451 queries), and ToolRet-full (44,453 tools, 7,961 queries).
Then I added competitors. Not toy baselines — actual published methods:
- Semantic Router (Aurelio Labs): embed descriptions as “routes” with centroid matching
- Tool2Vec (Berkeley): synthetic-query embeddings
- LangGraph BigTool: graph-based with an LLM in the loop
- Toolshed RAG: query expansion + cross-encoder reranking
And for good measure, Anthropic’s native tool use — give Claude Sonnet the full tool catalog and let it pick. About $100 in API calls for that one.
Initial Embedding Model Benchmark Results
On our mock-connectors dataset, the initial numbers were humbling:
| Strategy | Hit@1 (global) |
|---|---|
| Semantic Router | 35.4% |
| Tool2Vec | 29.4% |
| StackOne v1 (synonym enrichment) | 27.6% |
| Toolshed RAG | 26.8% |
To be clear, v1 wasn’t bad. It was built for a specific job: matching queries to actions within a known connector. And it did that job well (59% Hit@1 scoped, 84% Hit@5 on our production test set). But in a global search across all tools, the synonym dictionaries and domain terms didn’t help as much as I’d assumed. Semantic Router, which does nothing clever at all, just embeds raw descriptions with the same MiniLM model, beats it by 8 points.
The off-the-shelf methods all clustered between 27–35%. The LLM-based approaches were slower (~2,500ms per query) and not dramatically better. Every approach had hit the same ceiling.
Fine-Tuning the Embedding Model: Data Beats Architecture
This is where the autoresearch loop earned its keep. The objective was simple: maximise Hit@1 on held-out connectors across all benchmarks, keep inference under 10ms, keep the model small enough to deploy on a Lambda. The loop handled data preparation, training runs, ablation studies, and evaluation.
The insight that mattered was data engineering, not architecture.
Instead of runtime heuristics (synonym dictionaries, category mappings), we built training data that teaches the model the same distinctions:
- Cross-connector hard negatives: same action suffix from different connectors like GitHub or Jira. For example, “github_create_issue” vs “jira_create_issue”. The model has to learn that these are different tools, not just “create issue”
- Same-connector hard negatives: same connector, different action. Forces the model to distinguish between closely related operations
- LLM-generated paraphrases: 3 per instruction. “Make a new employee”, “add a team member”, “onboard a hire”. Exactly what the synonym dictionary was trying to do, but learned rather than hand-coded
- Connector-level holdout: 15% of entire connectors held out for validation, not just queries. The model has to generalise to tools it’s never seen
The split matters. If you hold out queries but the model has seen the tools during training, you’re measuring memorisation, not generalisation.
What the Loop Found About MiniLM, BGE, and Hard Negatives
Some of it was predictable. More hard negatives helped (we went from 3 to 7 per example). Doubling the text truncation from 1,000 to 2,000 characters preserved documentation context that turned out to be important.
Some of it surprised me. The fine-tuned MiniLM (22M parameters) beat BGE-base (109M parameters) on our internal data. Less capacity with good training data beats raw model size. Random sampling of training data outperformed every clever curriculum strategy the loop tried. And the biggest single lift (45% on ToolRet-full) came from fixing a format mismatch between training and evaluation. The model was embedding structured descriptions during training but raw documentation during eval. Aligning those was almost half the battle.
The loop also ran hard negative mining: run the model, find the queries it gets wrong, extract the tools it confused, and feed those back as targeted training signal. This is the kind of thing you wouldn’t bother doing manually. It’s tedious and iterative, which is exactly what an autonomous loop is good at.
v2 Embedding Model Results vs MiniLM, BGE, and 7B Models
The result is what we’re calling v2: a fine-tuned BGE-base model (109M params) trained on combined data from all sources. Here’s how it stacks up.
Mock-Connectors — Scoped (998 tools, know which connector)
| Model | Hit@1 | Hit@10 | MRR |
|---|---|---|---|
| v2 (fine-tuned) | 92.8% | 100% | 0.957 |
| v1 (synonyms + MiniLM) | 59.0% | 96.8% | 0.712 |
| BM25 | 36.5% | 91.6% | 0.537 |
Mock-Connectors — Global (all 998 tools)
| Model | Hit@1 | Hit@10 | MRR |
|---|---|---|---|
| v2 (fine-tuned) | 57.3% | 82.3% | 0.661 |
| Anthropic Tool Use | 44.0% | 69.3% | 0.517 |
| v1 (synonyms + MiniLM) | 27.6% | 58.7% | 0.377 |
ToolRet-full (44,453 tools, 7,961 queries)
| Model | nDCG@10 | Params |
|---|---|---|
| v2 (combined-v4) | 0.544 | 109M |
| Qwen3-Embedding-8B | 0.462 | 8,000M |
| NV-Embed-v1 | 0.427 | 7,000M |
| GritLM-7B | 0.411 | 7,000M |
| v1 (synonyms + MiniLM) | ~0.26 | 22M |
MetaTool (200 tools, 2,000 queries)
| Model | Hit@1 | Hit@5 | nDCG@5 |
|---|---|---|---|
| v2 (combined-v4) | 86.9% | 98.6% | 0.938 |
| Anthropic Tool Use | 83.3% | 96.7% | 0.913 |
| v1 (synonyms + MiniLM) | 62.0% | 87.6% | 0.762 |
| BM25 | 23.1% | 52.3% | 0.385 |
ToolBench (10,439 tools, 451 queries)
| Model | Hit@1 | Hit@10 | nDCG@5 |
|---|---|---|---|
| v2 (combined-v4) | 20.6% | 96.0% | 0.482 |
| Ada-002 (paper) | — | — | 0.387 |
| v1 (synonyms + MiniLM) | 11.8% | 51.4% | 0.255 |
| BM25 | 8.2% | 36.4% | 0.174 |
The headline: a 109M parameter model, trained on domain-specific data with proper hard negatives, is #1 on ToolRet-full. It beats Qwen3-8B, NV-Embed-v1, and GritLM-7B, all 60-70x larger. It also beats Anthropic’s native tool use at 30,000x less latency and near-zero marginal cost. Scoped search hits 92.8% Hit@1 and 100% Hit@10.
Deploying the Fine-Tuned Embedding Model on Modal
A model that wins benchmarks is useless if you can’t retrain and deploy it reliably. The autoresearch loop runs on Modal GPUs, so we needed signing keys to ensure only our CI pipeline can trigger training runs. Datasets live in S3 with versioning, so every training run is traceable to a specific data snapshot. The pipeline formats and validates datasets before training starts, because a malformed example at row 40,000 shouldn’t surface as a mysterious accuracy regression two hours into a run.
The most important piece: performance gates. The pipeline won’t promote a new model unless it beats the current production model on all four benchmarks. No regressions allowed. If it produces something worse, it just doesn’t ship. That’s what makes low-supervision retraining viable.
What I Learned About Fine-Tuning Embedding Models for Tool Retrieval
The synonym dictionaries, the domain enrichment, the category context maps: all the hand-crafted work in v1 added maybe 1-2 points over a vanilla embedding baseline. They felt good to write. The kind of thing an engineer builds in a weekend and ships. But they were solving the wrong problem. The real gains came from the training data. Connector-aware splits, structured hard negatives, failure-mined examples, and cross-domain data mixing.
The autoresearch loop didn’t discover any single brilliant insight. It just ran a lot of experiments I wouldn’t have run manually (ablations on negative sampling, format alignment checks, curriculum experiments that mostly didn’t work) and the compound effect of all those small wins added up to a model that doubles or triples accuracy depending on the benchmark.
The v2 model now powers Tool Discovery across StackOne’s MCP gateway and connector library.
If you have a clear metric and a fast feedback loop, pointing an autonomous research process at your own product is probably the highest-leverage thing you can do. I set the objectives, curated the data sources, and decided when something was good enough to ship. The iteration part is increasingly not my job.