How I Evaluate LLM Output Without a Ground-Truth Dataset

TL;DR

You don't need a labeled dataset to measure an AI feature. Hand-build 30 examples, assert directly on verifiable outputs, use a calibrated LLM-as-judge for the subjective cases, and harvest production signals once you're live. Make evaluation cheap instead of skipping it.

The honest starting position for most AI features is this: you have a prompt, a vague sense that it “seems to work,” and zero labeled data telling you how often it’s actually right. No answer key. And yet someone needs to decide whether to ship it.

Waiting for a perfectly labeled dataset is how features sit in limbo for months. You don’t need one. Here’s the progression I use to get from “seems fine” to a real number.

Start with 30 examples you label yourself

The single highest-leverage thing you can do is sit down for an afternoon and build a tiny eval set by hand. Collect 30–50 realistic inputs — pulled from actual usage if you have it, invented if you don’t — and write down what a good output looks like for each.

This feels too small to matter. It isn’t. Thirty examples will surface your most common failure modes immediately, and they give you something a vibe never will: a number that moves when you change something. “8 of 30 failed” is infinitely more useful than “it seems a bit off sometimes.”

Keep these in version control next to the code. They’re a test suite, not a one-off.

Match the check to the task

Not every output needs the same kind of evaluation. Sort your task into one of three buckets:

Verifiable output. If the model produces something with a checkable property — valid JSON, a query that runs, code that compiles, an answer in an allowed set — then assert on that property directly. No judgment needed. Just code.

def check(output: str) -> bool:
    try:
        data = json.loads(output)
        return "summary" in data and len(data["summary"]) <= 280
    except json.JSONDecodeError:
        return False

Reference-based. If you have an expected answer, compare against it — but rarely with exact string match (the model will phrase things differently and still be right). Check for key facts being present, or use embedding similarity to flag answers that drifted far from the reference.

Open-ended. Summaries, rewrites, explanations — no single right answer. This is where most people give up on measurement. Don’t. This is what LLM-as-judge is for.

Use an LLM as a judge — but calibrate it first

For subjective qualities — is this summary faithful to the source? is this answer grounded, or did it make something up? is the tone right? — you can have a model score the output. It works far better than people expect, if you do two things:

Score one specific dimension at a time with a concrete rubric. “Rate 1–5 whether every claim in the summary is supported by the source text” beats “rate the quality.” Vague criteria produce vague scores.
Calibrate the judge against your own labels. Take the 30 examples you labeled by hand, run the judge on them, and check whether it agrees with you. If it doesn’t, fix the rubric until it does. An uncalibrated judge is just a second opinion you haven’t earned the right to trust.

The trap to avoid: judges are biased. They favor longer answers, prefer their own writing style, and are swayed by confident tone over correctness. Calibration against human labels is what catches this.

Production is the eval set you’ve been waiting for

Once the feature is live, you’re sitting on the best dataset you’ll ever have — if you capture it. Log every input, output, prompt version, and any downstream signal: did the user accept the suggestion, edit it, retry, or abandon it? Those implicit signals are a continuous, free, perfectly representative measure of real quality.

Periodically pull the failures and the edited outputs into your hand-labeled set. It grows on its own, and it stays representative of what users actually do — which your invented examples never quite are.

The takeaway

“No labeled data” is not a reason to ship on faith. Build 30 examples by hand, match the check to the task, use a calibrated judge for the fuzzy stuff, and harvest production signals once you’re live. None of it requires a research team or a labeling budget — just the decision to measure instead of guess.

The teams that move fast with AI aren’t the ones who skip evaluation. They’re the ones who made it cheap.