The hard part of an LLM feature isn't the model — it's the engineering around it. Treat model calls as unreliable, measure quality with a small eval set, version your prompts, watch cost and latency, and design for the wrong answer. Reliability, not cleverness, is what makes it shippable.
There’s a gap between a demo that wows people in a meeting and a feature that holds up at 2 a.m. when a customer is using it and you’re asleep. I’ve spent the last couple of years shipping LLM-backed features, and almost everything hard about it lives in that gap.
The model is the easy part now. The reliability around it is the job.
Here’s what I’ve learned actually matters once a feature has real users.
Treat the model as an unreliable network call, not a function
A function returns the same thing every time. An LLM doesn’t. Same input, slightly different output — and occasionally a completely different output. If your code assumes determinism, it will break in ways that are miserable to reproduce.
So I wrap every model call the way I’d wrap a flaky third-party API:
- Timeouts and retries with backoff. Providers have latency spikes. A p50 of 800ms can have a p99 of 12 seconds.
- A fallback path. If the call fails or times out, what does the user see? “Try again” is a valid answer. A spinner that never resolves is not.
- Structured output you validate. Don’t trust the shape. Ask for JSON, then parse-and-validate it. If validation fails, retry once, then fall back.
def extract_fields(text: str) -> Fields | None:
for attempt in range(2):
raw = call_model(prompt(text), timeout=10)
try:
return Fields.model_validate_json(raw) # pydantic
except ValidationError:
log.warning("invalid model output", attempt=attempt, raw=raw)
return None # caller decides what to do with a miss
That return None is the most important line. The model will miss. Your system’s reliability is defined by what happens when it does.
You can’t ship what you can’t measure
The first question anyone should ask about an LLM feature is “how often is it right?” — and most teams can’t answer it. They shipped on vibes.
You rarely have a labeled dataset on day one. That’s fine. You can still measure:
- Build a small eval set by hand. 30–50 real examples with expected outcomes gets you surprisingly far. This is an afternoon of work and it changes everything.
- Use an LLM as a judge for fuzzy cases — scoring tone, relevance, or whether an answer is grounded in the source. Calibrate the judge against your own labels first so you trust it.
- Log everything in prod: input, output, latency, which prompt version, and any downstream signal (did the user accept the suggestion, retry, or rage-quit?). Production is the best eval set you’ll ever have — if you capture it.
Without this, every prompt change is a guess. With it, you can say “v4 of the prompt improved grounding from 82% to 91%” — and that’s the sentence that makes the feature shippable.
Prompts are code. Version them like code.
A prompt buried in a string literal that someone edited in a hotfix is a production incident waiting to happen. Treat prompts as versioned artifacts:
- Keep them in source control, not a database someone edits live.
- Tag outputs with the prompt version that produced them, so a regression is traceable.
- Change one variable at a time and re-run your evals. “I tweaked the prompt and also switched models and also changed temperature” tells you nothing when quality moves.
Cost and latency are product features, not afterthoughts
Two calls chained together feel fine in a demo. Five calls in an agent loop, each waiting on the last, is a 20-second response and a bill that scales linearly with success. Things that have saved me real money and milliseconds:
- Use the smallest model that passes your evals. Don’t default to the biggest one out of caution. Measure, then right-size.
- Cache aggressively. Identical or near-identical inputs are common in real traffic. Cache them.
- Stream when there’s a human waiting. Perceived latency drops dramatically when text appears token by token, even if total time is unchanged.
Design for the wrong answer
This is the mindset shift. Traditional software is correct or it has a bug you fix. LLM features are probabilistically correct, and you’re never driving the error rate to zero. So the design question isn’t “how do I make it always right?” — it’s “what’s the blast radius when it’s wrong?”
- Drafting, not deciding. Let the model propose; let the human (or a deterministic check) dispose. A wrong draft is annoying. A wrong autonomous action is an incident.
- Make corrections cheap. An easy edit button beats a confident wrong answer the user can’t fix.
- Constrain the surface. The narrower the task, the more reliable the system. “Summarize this document” is tractable. “Do anything the user asks” is a research project.
The takeaway
Building with LLMs in production is less about the model and more about the engineering you wrap around it: validation, evals, observability, versioning, and honest design for failure. The teams that ship features people trust aren’t the ones with the cleverest prompts. They’re the ones who treated the model as one unreliable component in a system they actually engineered.
That’s the unglamorous part. It’s also the whole job.