AI Agent Testing: What We Found Testing Thousands of Agent Tools
Table of Contents
StackOne’s connectors expose thousands of actions across dozens of SaaS platforms: SAP SuccessFactors, Workday, Klaviyo, Dropbox, and others. Each action has a schema that tells an AI agent what parameters to send and what to expect back.
My job was to find out whether agents could actually use these tools. I built a testing framework to validate connectors at scale, first checking that each API action did what its schema said, then checking that agents could string those actions together to finish real tasks.
The answer, in a lot of cases, was no. And the failures were hard to catch because most of them looked like things were working fine.
This isn’t about using AI to test software. It’s about testing the tools that AI agents depend on.
Testing AI agent tools means validating that the schemas, descriptions, and responses an API exposes work when an autonomous agent, not a human developer, is the consumer. This requires two layers: action-level validation (does each API call do what its schema promises?) and task-level validation (can an agent chain actions to complete real work?). For a map of the 120+ tools in the AI agent ecosystem, see the agentic AI tools landscape.
The Most Common AI Agent Testing Failures Look Like Successes
Early in the validation work, I noticed a pattern. A connector would return results, the status code would be 200, and everything looked normal. But the data was wrong, or missing, or stale.
The most common failures in AI agent testing are false positives: API responses that return HTTP 200 but contain errors the agent cannot detect.
The worst version of this was a provider that returned a 429 “Too Many Requests” error inside the response body, while the HTTP status code stayed 200. In the raw logs, the rate-limit error was right there. But by the time the response reached the agent, it had been normalized to:
{
"status": 200,
"data": null
}
To the agent, that just meant “no records found.” It moved on to the next step with that assumption. No retry, no backoff, no flag. The request had failed, but nothing in the response told the agent that.
A developer looking at this would check the logs, see the 429, and know what happened. An agent doesn’t check logs. It trusts the response it gets.
This is the class of bug I kept finding. Things that look correct at the interface level but aren’t, and that agents have no way to investigate further.
Testing AI Agent Tools: Action-Level Validation
The first layer of the framework validates individual API actions. Does a filter actually filter? Does a pagination limit get respected? Does the response match what the schema promises?
A lot of the issues I found here passed a basic smoke test. You’d have to look closely to see something was off.
When the API Schema Makes the Action Impossible
One action was supposed to create a learning object. It returned:
createdRecords: 0
Result: "Failed"
Reason: "MaterialType missing"
HTTP 200, of course. But the field it was asking for, MaterialType, didn’t exist anywhere in the schema definition. There was no way to set it. The action could never succeed, and the schema gave no indication of that.
For agents, the schema is the contract. If the contract says “send these fields and you’ll create a record,” the agent sends those fields. When the contract is wrong, the agent fails and has no way to understand why.
Testing AI Agent Behavior: Task-Level Failures
Even when every individual action works correctly, agents can still fail at the task level. They have to interpret what the user wants, pick the right tool, fill in the parameters, and sometimes chain several actions together. This is where a different set of problems showed up.
Skipping Steps: When Agents Miss Required Lookups
I gave an agent the prompt: “Show all deductions for employee ‘John Doe’.”
That requires two steps: resolve the name to an employee ID, then use that ID to pull deduction data. The agent skipped the first step. It passed “John Doe” directly as the employee ID parameter.
The API returned an empty result set. From the agent’s perspective, John Doe just didn’t have any deductions. In reality, the lookup never happened. The agent gave a confident, wrong answer.
Picking the Wrong Tool from Similar Options
Some providers expose tools with similar names, like search_positions and search_position_budgets. When I asked for budget data, one model picked the wrong tool entirely. Another picked the right tool but then made extra metadata calls to double-check the result, which actually worked out.
The difference between the two wasn’t the API. It was how the models read the tool descriptions. When I improved the descriptions to be more specific about what each tool returned, the wrong-tool problem mostly went away. Improving tool descriptions had a larger impact on agent accuracy than improving agent logic itself.
Same Prompt, Different Input Formatting Across Models
I asked two different models: “Get detailed information about job code ‘JOB 34553’.”
One model passed JOB34553 to the API, stripping the space. The other preserved the input exactly and got the right result. Same prompt, same tools, different interpretation of how to format the input before making the call.
These kinds of differences are small individually. But when an agent is chaining five or six actions together, each one has a chance to introduce this kind of drift.
What AI Agent Testing Revealed
After running these validations across a large number of connectors, a few things became clear.
False positives are the most common failure. A 200 status code doesn’t mean the request did what you asked. It means the server didn’t crash. Those are different things, and agents treat them as the same.
Schema accuracy matters more than you’d expect. If the schema says a field exists and it doesn’t, or says a field is optional and it’s actually required, the agent will follow the schema and fail. It won’t try variations or read error messages the way a developer would.
Tool descriptions shape model behavior. This surprised me. The wording of a description changed which tool a model picked, how it formatted parameters, and whether it added extra validation steps. Small wording changes had outsized effects.
Different models fail differently. We ran the same test prompts through multiple models and got different failure modes. One model would skip steps, another would pick the wrong tool, another would reformat inputs. You can’t validate an agent tool against just one model and call it done.
What I found wasn’t a list of isolated bugs. It was a pattern: small inconsistencies that a human developer would work around in thirty seconds become real failure points when an agent hits them. Schemas that look clear to a person can still be ambiguous to a model. Responses that appear successful can produce wrong outcomes.
At StackOne, this is why we test every connector action before agents touch it. Once agents are in the loop, small inconsistencies don’t stay small. They compound across steps, across systems, and across decisions.