Detection Isn't the Hard Part

Most prompt injection research optimises for one number: recall. Can the model catch the attack?

In production that turns out to be the easier problem. The harder one is calibration on legitimate data. A classifier that flags 7% of benign tool results means a blocked Workday employee record every 14 API calls, or a refused CRM lookup every dozen tasks. That isn’t a security product, it’s a bug.

This post is about how we got Defender’s false positive rate on real connector data from roughly 7% down to about 0.2%, a ~97% reduction, with two fixes at different layers of the system. The 22MB MiniLM-L6 model at the centre of it is the subject of a separate post; here we’re focused on everything around the model that decides whether it’s actually shippable.

The problem nobody shows you in benchmarks

Academic prompt injection benchmarks are mostly attack-only or attack plus generic-benign. Wikipedia paragraphs, conversational data, jailbreak corpora. None of them include the kind of payload an agent actually consumes in production: an HRIS employee record, an ATS candidate note, a CRM activity log, an email body.

When we ran the v1 classifier on real connector payloads, the false positive rate was alarming. Roughly 7% of benign payloads were flagged as suspicious. Some examples of what trips a naive classifier:

"restrictAbsenceEdit", a Workday config field that reads like an imperative command
"Bulk updated description", a CRM activity log that looks like an instruction
"Delete Test Field", a literal admin label that’s lexically identical to an attack

The model wasn’t broken. It had learned that imperative-sounding phrases are suspicious, which they often are, and applied that signal uniformly. The training distribution was the issue. It had no examples teaching the model that “Delete Test Field” is what real systems write into their own metadata.

An enterprise payload and an injection attempt can be lexically identical to a model that has only seen Wikipedia for negatives. Fixing that needed two different fixes at two different layers.

Fix #1: training-time hard negatives

A hard negative is a benign training example that looks like a positive. It sits close to the decision boundary, exactly where a naive model is most likely to be wrong. Random Wikipedia text is an easy negative because nothing about it looks like an injection. "restrictAbsenceEdit" is a hard negative because every surface signal except the underlying intent points at suspicious.

The fix here wasn’t a one-off connector patch. It was a recurring pattern. Every time we found a class of false positives in production, we built a distribution-matched hard-negative dataset for that class and retrained the model. The shipped weights are calibrated against four different shapes of benign data:

Connector-shaped data drawn from real benign tool-result payloads across HRIS, ATS, and CRM connectors
Email content, which has different rhythms than connector field values: longer-form, multi-sentence, sometimes templated (2FA notifications, calendar invites, automated alerts)
Multilingual benign content, to prevent language-based false positives
Dev tooling output like build logs, error tracebacks, and commit messages, which contain a surprising amount of imperative vocabulary (“delete”, “kill”, “ignore”) in completely benign contexts

The model now has a learned sense of what “looks imperative but isn’t” looks like in each of these shapes.

The impact on real connector data, measured on a 940-payload benchmark across our HRIS, ATS, and CRM connectors:

Defender version	False positive rate
v0.5.8 (baseline)	~7%
v0.6.3 (with distribution-matched hard negatives)	~1.7%

A 4× reduction from training-data choices alone, before any inference-time work.

The important framing here: hard negatives aren’t a config knob. The user gets the calibrated weights when they install the package. The leverage is at the data layer, not at the API.

Fix #2: inference-time structural filtering

The second fix is in front of the model, not inside it.

Defender ships with a small (~0.7MB) FastText classifier called the Semantic Field Extractor, or SFE. It runs before the injection model. For each string field in the JSON payload, SFE decides whether to send it to Tier 2 (pass) or skip it (drop). Drops cover UUIDs, ISO timestamps, opaque tokens, version strings, and ID-shaped strings. Things that aren’t realistic injection vectors but score noisily on a content classifier because they look unusual.

SFE earns its keep two ways.

It reduces false positives further. On the same connector benchmark:

Configuration	False positive rate
v0.6.3, SFE off	~1.7%
v0.6.3, SFE on	~0.2%

Another 8× reduction on top of the model improvements. Combined with the hard-negatives fix, that’s the headline result: ~7% baseline down to ~0.2%.

Bar chart of false positive rate on real connector data across three Defender configurations: v0.5.8 baseline at ~7%, v0.6.3 with hard negatives at ~1.7%, v0.6.3 with SFE on at ~0.2%.

It makes Defender faster, not slower. Counterintuitive at first. Adding a layer in front of the model improves end-to-end latency, because SFE drops roughly 45% of fields (measured on our 940-payload connector benchmark) before they hit Tier 2, so the heavy classifier has less to score.

Configuration	Mean latency per payload
v0.6.3, SFE off	~83 ms
v0.6.3, SFE on	~45 ms

Roughly a 2× reduction end-to-end on enterprise connector payloads (which run multi-KB JSON with many fields). On smaller payloads the absolute numbers are lower — the README’s ~10ms badge is sample-sized text. The largest wins come on metadata-heavy payloads, where SFE strips the long tail of UUIDs and timestamps before scoring.

There’s a deeper reason SFE works the way it does. It classifies fields by shape and content type: UUID-shaped, timestamp-shaped, integer ID, opaque token. Not by connector identity. UUIDs look like UUIDs whether they come from Workday, Greenhouse, or a customer’s bespoke internal API. That property matters for the next section.

What about connectors we haven’t seen?

Hard negatives raise the floor on schemas the model has been trained on, or close cousins of them. They can’t help on a payload shape the model has never seen, like a customer’s custom internal tool or a third-party API not in our connector library.

This is where the structural nature of SFE pays off. Dropping UUIDs and timestamps works regardless of where they come from. Hard negatives can’t generalise like that; structural patterns can.

To check whether the system holds up on data it wasn’t trained on, we ran it against two public out-of-distribution datasets:

Dataset	What it covers	False positive rate (v0.6.3, SFE on)
ToolACE	Rich enterprise API responses across finance, security, medical	~0.9%
ChatML	Open conversational dataset, 10K samples	~0.1%

Under 1% false positives on enterprise API data the model has never been fine-tuned on. SFE doesn’t change these numbers much, because the structural filtering does the same job regardless of what the model has seen. What’s notable is that the floor holds at all.

Why you need both

The two fixes solve different problems at different layers:

Hard negatives address content: what counts as benign in a given distribution
SFE addresses structure: what’s worth scoring in the first place

They fail differently. Hard negatives don’t generalise to schemas the model hasn’t seen. SFE can’t help when the noisy field is one the user actually writes into (name, description, body). Together they cover the cases each one alone would miss.

Current state

The numbers in this post are measured on Defender v0.6.3:

~0.2% FPR on real connector data (with SFE enabled)
~45 ms mean latency per payload (~2× faster than Defender without SFE, on enterprise connector payloads)
Under 1% FPR on out-of-distribution enterprise API data

SFE is opt-in in the npm package (useSfe: true); it’s enabled by default in StackOne’s hosted Defender deployment, which is where the 0.2% headline is measured. For free-form text classification (a single string rather than a JSON payload) SFE has nothing to filter and the model runs alone.

Defender has since moved to v0.7.0, which ships a multi-head classifier with temperature scaling. The calibration story in this post still applies — hard negatives and SFE are both still load-bearing — but the architecture has evolved past the single MiniLM-L6 head described here.

Closing

The headline of this post isn’t “we built a better model.” The model is the same MiniLM-L6 from the previous post. The headline is that we paid attention to the calibration problem, and the lever wasn’t always the model. Sometimes it was the training data. Sometimes it was a 0.7MB filter in front of the model that decides what’s worth scoring at all.

Detection wasn’t the hard part. Calibration was.

The next problem on our radar is custom connectors: payload shapes we’ll never see during training. SFE buys us a structural floor on those, but the long-tail content inside name/description/body fields is still where adversaries get to be creative. More on that in a future post.

Defender is open source: github.com/StackOneHQ/defender. Try it, file issues, send hard negatives.