Announcing our $20m Series A from GV (Google Ventures) and Workday Ventures
Read More
Published:
August 20, 2025
4 min
Written By
Will Leeney
AI Engineer
August 20, 2025
4 min

Building an Eval-Driven AI Feature at StackOne

StackOne connects you to hundreds of providers through one API. But things go wrong - expired keys, missing scopes, unsupported actions—and we surface the provider's raw error. Problem is, a Workday auth error referencing Workday-specific concepts doesn't help much in our dashboard.

I'm Will Leeney, AI Engineer at StackOne. I’ve completed a PhD studying best practices for evals and evaluating under randomness. My first month at StackOne was spent building an AI agent that translates these provider errors into clear resolution steps. Here’s that story. The sauce? Starting with evals from day one.

Start with the Problem, Not the Solution

Before writing any code, I interviewed Bryce, one of our solutions engineer, to understand the manual process of resolving errors. Users encounter errors in three places: account linking, account status, and connection logs. Each error type has documented solutions scattered across our guides. The goal was simple: automatically search our docs and generate resolution steps, saving Bryce's time for building integrations.

I resisted the urge to build a complex RAG system with time-series analysis and previous resolution patterns. Instead, I sketched the simplest possible flow, see Figure 1. Ship first, improve later.

Figure 1: A rough sketch of what the agent will do to generate the resolution steps.

Build Tools, Test Early

The core of the system is Claude with custom tools that search our documentation. Here's what the tool registration looks like:

{  
  "name": "grab_error_code_guide",
  "description": "Fetch unified API error codes from StackOne",  
  "input_schema": 
    {    
      "type": "object",
      "properties": {},
      "required": []  
    }
}

One thing that massively sped up development: all our docs are exposed as plain .txt and .md files at docs.stackone.com/llms.txt. No scraping, no complex parsing—just grab the text and go. This turned what could've been days of data pipeline work into a simple fetch request.

I built five tools:

  1. grab_error_code_guide: Fetches our error code reference
  2. grab_troubleshooting_guide : fetches our troubleshooting guide
  3. search_stackone_docs : Search all of our docs
  4. search_provider_guides: Searches our provider-specific documentation
  5. search_docs: Vector search across all documentation using Turbopuffer (a vector database)

The trickiest part was extracting the right context from messy logs. I used Pydantic for structured outputs with gpt-4o-mini to pull the exact fields needed for resolution.

Evals Drive Everything

Here's where most AI features fail: they ship without knowing if they actually work. We took a different approach.

First, I connected the Lambda function to LangSmith for full observability. Then I created real errors in our dev environment and ran them through the resolution agent. In LangSmith, I manually corrected the outputs to create a golden dataset. This became our evaluation benchmark.

Now every prompt change could be measured. Did the new prompt improve resolution quality? The evals told us immediately. No more guessing whether changes helped or hurt.

Ship with Feedback Loops

We realised that customers may want to give feedback on the resolution steps once it was in production. We could make use of the real-world examples by implementing a feedback system which we could translate into a better eval database.

  1. Each resolution gets a unique trace ID from LangSmith
  2. Users can rate the resolution quality
  3. Feedback flows back to LangSmith, expanding our eval dataset
  4. Real-world usage continuously improves the model

This created a virtuous cycle: ship → collect feedback → improve evals → enhance prompt → ship better version.

The Technical Stack

  • Lambda: Hosts the resolution agent
  • LangSmith: Tracks every generation and builds eval datasets
  • Logfire: Monitors LLM performance
  • DataDog: Tracks execution metrics
  • PostHog: Measures feature usage
  • CDK: Manages infrastructure as code

The frontend integration was straightforward thanks to our design and engineering team. We exposed a single endpoint /ai/resolutions that accepts error logs and returns resolution steps.

Lessons Learned

  1. Start with evals, not features. Without evaluation, you're flying blind.
  2. Ship the simplest version first. My initial RAG design would have taken months. The basic version shipped in weeks.
  3. Real usage beats synthetic data. Customer feedback creates better evals than any test data.
  4. Tools are just functions. Don't overthink LLM tool design - they're Python functions with JSON schemas.

The entire feature went from concept to production in one month. Not because we rushed, but because we focused on what mattered: knowing whether it actually worked. Evals aren't an afterthought in AI development - they're the foundation that makes shipping with confidence possible.

Subscribe to our newsletter
New integration resources and product updates.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Subscribe to our newsletter

New integration resources and product updates.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Join StackOne

Start saying yes to your customers

All the tools you need to build real-time integrations, at scale, with best-in-class security & privacy.
Get Started Now
Credits
|