Building an Eval-Driven AI Feature at StackOne

StackOne connects you to hundreds of providers through one API. But things go wrong - expired keys, missing scopes, unsupported actions—and we surface the provider's raw error. Problem is, a Workday auth error referencing Workday-specific concepts doesn't help much in our dashboard.

I'm Will Leeney, AI Engineer at StackOne. I’ve completed a PhD studying best practices for evals and evaluating under randomness. My first month at StackOne was spent building an AI agent that translates these provider errors into clear resolution steps. Here’s that story. The sauce? Starting with evals from day one.

Start with the Problem, Not the Solution

Before writing any code, I interviewed Bryce, one of our solutions engineer, to understand the manual process of resolving errors. Users encounter errors in three places: account linking, account status, and connection logs. Each error type has documented solutions scattered across our guides. The goal was simple: automatically search our docs and generate resolution steps, saving Bryce's time for building integrations.

I resisted the urge to build a complex RAG system with time-series analysis and previous resolution patterns. Instead, I sketched the simplest possible flow, see Figure 1. Ship first, improve later.

Figure 1: A rough sketch of what the agent will do to generate the resolution steps.

Build Tools, Test Early

The core of the system is Claude with custom tools that search our documentation. Here's what the tool registration looks like:

{  
  "name": "grab_error_code_guide",
  "description": "Fetch unified API error codes from StackOne",  
  "input_schema": 
    {    
      "type": "object",
      "properties": {},
      "required": []  
    }
}

One thing that massively sped up development: all our docs are exposed as plain .txt and .md files at docs.stackone.com/llms.txt. No scraping, no complex parsing—just grab the text and go. This turned what could've been days of data pipeline work into a simple fetch request.

I built five tools:

grab_error_code_guide: Fetches our error code reference
grab_troubleshooting_guide : fetches our troubleshooting guide
search_stackone_docs : Search all of our docs
search_provider_guides: Searches our provider-specific documentation
search_docs: Vector search across all documentation using Turbopuffer (a vector database)

The trickiest part was extracting the right context from messy logs. I used Pydantic for structured outputs with gpt-4o-mini to pull the exact fields needed for resolution.

Evals Drive Everything

Here's where most AI features fail: they ship without knowing if they actually work. We took a different approach.

First, I connected the Lambda function to LangSmith for full observability. Then I created real errors in our dev environment and ran them through the resolution agent. In LangSmith, I manually corrected the outputs to create a golden dataset. This became our evaluation benchmark.

Now every prompt change could be measured. Did the new prompt improve resolution quality? The evals told us immediately. No more guessing whether changes helped or hurt.

Ship with Feedback Loops

We realised that customers may want to give feedback on the resolution steps once it was in production. We could make use of the real-world examples by implementing a feedback system which we could translate into a better eval database.

Each resolution gets a unique trace ID from LangSmith
Users can rate the resolution quality
Feedback flows back to LangSmith, expanding our eval dataset
Real-world usage continuously improves the model

This created a virtuous cycle: ship → collect feedback → improve evals → enhance prompt → ship better version.

The Technical Stack

Lambda: Hosts the resolution agent
LangSmith: Tracks every generation and builds eval datasets
Logfire: Monitors LLM performance
DataDog: Tracks execution metrics
PostHog: Measures feature usage
CDK: Manages infrastructure as code

The frontend integration was straightforward thanks to our design and engineering team. We exposed a single endpoint /ai/resolutions that accepts error logs and returns resolution steps.

Lessons Learned

Start with evals, not features. Without evaluation, you're flying blind.
Ship the simplest version first. My initial RAG design would have taken months. The basic version shipped in weeks.
Real usage beats synthetic data. Customer feedback creates better evals than any test data.
Tools are just functions. Don't overthink LLM tool design - they're Python functions with JSON schemas.

The entire feature went from concept to production in one month. Not because we rushed, but because we focused on what mattered: knowing whether it actually worked. Evals aren't an afterthought in AI development - they're the foundation that makes shipping with confidence possible.

‍