Non-Functional Requirements for AI Systems: What Staff Engineers Should Specify

TL;DR

Teams spec what an AI feature should do and forget how well it must do it. Write down the accuracy threshold, latency budget, cost ceiling, fallback behavior, observability, and governance requirements up front. For a probabilistic system, the non-functional requirements are what define "working."

Every team can tell you what their AI feature is supposed to do. Far fewer can tell you how accurate it has to be, how slow it’s allowed to get, what it costs per request, or what happens when it fails. Those are the non-functional requirements — and for AI systems, they’re not a footnote. They’re the difference between a demo and a product.

A functional spec for a traditional feature is mostly complete on its own, because the behavior is deterministic. An AI feature is probabilistic, which means the NFRs aren’t optional context — they define what “working” even means. Here’s the checklist I push every team to fill in before building.

Accuracy: name the threshold and the cost of being wrong

“It should be accurate” is not a requirement. Specify:

The bar. What success rate makes this worth shipping? 90%? 99%? The answer depends entirely on the next point.
The cost of a miss. A wrong autocomplete suggestion costs nothing. A wrong update to a financial record costs a lot. The acceptable error rate is a function of blast radius, and it should be written down.
How you’ll measure it. If there’s no eval methodology in the spec, the accuracy target is decoration. (You don’t need a labeled dataset to start — but you do need a plan.)

Latency: set a budget, including the tail

Specify a latency target, and specify it at the tail, not the average. A p50 of 800ms means nothing if your p99 is 15 seconds — that long tail is what users actually remember.

And be explicit about perceived latency. Streaming a response token-by-token changes the experience even when total time is unchanged. If the spec says “feels fast,” it should say how: a streaming response, an optimistic UI, or work moved off the critical path.

Cost: a per-request ceiling, not a monthly surprise

AI features have a unit cost that scales with usage — and for agents, with success, since the hard problems take more turns. Specify a target cost per request and a model for how it scales. “We’ll watch the bill” is how you discover at month-end that the feature is unprofitable per use. Decide the ceiling up front; it constrains model choice and architecture.

Fallback: define the behavior when the model fails

The model will fail — time out, return garbage, or hit a rate limit. The spec must answer: what does the user see then? Acceptable answers include a graceful “try again,” a degraded non-AI path, or a cached result. An unacceptable answer is “we didn’t think about it,” which in practice means a spinner that never resolves.

This is the NFR teams skip most often and regret most reliably.

Observability: decide what you log before you ship

For a probabilistic system, observability isn’t debugging infrastructure — it’s how you measure quality at all. Specify that every request logs its input, output, model and prompt version, latency, cost, and any downstream user signal. Without this, you can’t tell whether last week’s change helped or hurt. Retrofitting it after launch means throwing away the data you most needed.

Governance: the requirements that come from outside engineering

On enterprise and regulated systems, some NFRs aren’t yours to choose. Specify them explicitly so they’re designed in, not discovered in an audit:

Data handling — what can and can’t be sent to a model provider? Where does inference run?
Auditability — can you reconstruct why the system did what it did, months later?
Attribution and reversibility — when AI changes a record, is it traceable and undoable?

The takeaway

The functional spec for an AI feature is the easy half and the half everyone writes. The non-functional requirements — accuracy threshold, latency budget, cost ceiling, fallback behavior, observability, governance — are what determine whether it survives contact with production. Specifying them is exactly the kind of unglamorous, high-leverage work that’s a staff engineer’s job.

If your AI project doesn’t have these written down, that’s not a small gap. That is the project’s biggest risk, unmanaged.