MMatt Goren
← AI hub
GuideBuilding with LLMsEvals & Quality

Evals: How to Actually Know Your AI Works

Vibes-testing lies to you. Here's how I build eval sets, grade outputs, and run regression tests so I know a model change didn't quietly break things.

By Matt Goren · Updated June 25, 2026 · 8 min read

I have shipped AI features that demoed beautifully and then fell apart the first week real people touched them. Every time, the root cause was the same: I trusted my own eyeballs. I typed three prompts I already knew the answers to, the model nailed them, and I called it done. That is vibes-testing, and it is the single most expensive habit in building with language models.

The fix is evals. An eval is just a saved set of real inputs paired with a way to score the output. It is the difference between "this feels good" and "this passes 47 of 50 cases, and here are the 3 it fails." Once you have that number, every prompt edit and every model swap becomes a measurable decision instead of a prayer. This is the discipline that lets me move fast without breaking the thing quietly.

Why vibes-testing fails

Vibes-testing fails for reasons that are baked into how these models work, not because you are careless.

First, language models are non-deterministic. The same prompt can produce a slightly different answer on two runs. When you test once and it works, you have learned almost nothing about the next thousand runs.

Second, you test the happy path. You naturally reach for inputs you understand, which are the inputs the model handles well. The cases that actually hurt you in production are the weird ones: the empty field, the sarcastic customer, the 4,000-word pasted email, the question in Spanish. Those never show up in a casual hand-test.

Third, you have no memory. You change a prompt to fix one problem, eyeball two examples, and ship. You have no idea that your fix broke a category of inputs that worked yesterday. Without a saved set of cases, every change is a coin flip, and the coin is invisible.

The whole point of evals is to replace your fallible attention with a repeatable, written test that does not get tired, does not get optimistic, and does not forget.

Build the eval set from real cases

Your eval set is the most valuable asset in the whole system. Spend your effort here, not on clever grading code.

1. Pull cases from reality, not your imagination

The best cases come from things that actually happened. Mine your logs, support tickets, and the screenshots people send you when something looks wrong. If you are pre-launch and have no usage yet, write cases that mirror the real distribution of work you expect, then replace them with genuine examples the moment traffic arrives.

I aim for twenty to fifty cases to start. That sounds small, but fifty real, well-chosen cases will teach you more than a thousand synthetic ones. Quality and coverage beat raw count.

2. Weight toward failures and edges

Do not build a set of easy wins. Deliberately overweight the cases that scare you: the ambiguous request, the adversarial input, the one with a typo in the part that matters, the one where the right answer is "I don't know." These are where regressions hide. A test suite made of softballs passes every time and protects you from nothing.

3. Write down the expected behavior

For each case, record what a good answer looks like. Sometimes that is an exact value ("the extracted price is 49.99"). Sometimes it is a set of must-haves ("mentions the refund window, stays under 80 words, does not promise a callback"). Sometimes it is just a rubric for a human or a judge to apply. Whatever form it takes, write it down before you look at the model's output, so you are grading against a standard rather than rationalizing whatever came back.

4. Keep it in version control

Store cases as plain files (JSON, YAML, CSV) next to your code. When production surfaces a new failure, the fix is not just a prompt edit; it is a new eval case so that exact failure can never silently return. Your set grows every time the world surprises you.

Choose a grading method

How you score an output depends on what kind of output it is. I use three methods, and most real systems mix all three.

Exact-match and programmatic checks

When there is a single correct answer, just compare. Classification labels, extracted fields, structured JSON, yes/no decisions: these you can check with code. Did it return "refund"? Is the price field a number? Does the JSON parse and match the schema? This is the cheapest, most reliable grading you can do, so push as much of your output into checkable shapes as you reasonably can.

You can also do programmatic checks that are not exact-match: length limits, required substrings, "must not contain a phone number," "must be valid JSON." These catch a surprising amount.

Rubric grading

For open-ended output, write a rubric: a short list of criteria, each independently checkable. For a support reply, mine might be: addresses the actual question, cites the correct policy, stays professional, under 100 words, no invented facts. Each criterion is pass or fail. The score is how many it hits. Rubrics turn "is this a good answer?" into several small, answerable questions, which is the only way to grade fuzzy output consistently.

LLM-as-judge

You cannot write code to measure tone, helpfulness, or faithfulness. For those, use a second model as the grader. You hand the judge the input, the output, and your rubric, and it returns a score with reasoning. The model doing the work and the model grading it are kept separate, which keeps the grader honest.

Here is the shape of a judge call with the official SDK, using a strong model and structured output so I get back a clean score I can compute on:

import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();

const res = await client.messages.create({
  model: "claude-opus-4-8",
  max_tokens: 1024,
  output_config: {
    format: {
      type: "json_schema",
      schema: {
        type: "object",
        properties: {
          faithful: { type: "boolean" },
          score: { type: "integer" },
          reason: { type: "string" },
        },
        required: ["faithful", "score", "reason"],
        additionalProperties: false,
      },
    },
  },
  messages: [{ role: "user", content: judgePrompt(input, output, rubric) }],
});

The judge is itself a thing you have to trust, so verify it. Before you rely on judge scores at scale, grade thirty cases by hand and compare. If the judge agrees with you most of the time, use it. If it does not, fix the rubric or the judge prompt until it does. An unverified judge is just a confident stranger.

The judge-panel pattern

A single judge has opinions and blind spots, the same as a single human reviewer. For decisions that matter, I run a panel: several judge calls, sometimes with different rubric framings or a higher reasoning effort, and I take the majority or the average. Disagreement among judges is signal in itself. When the panel splits, that case is genuinely ambiguous and deserves a human look. A panel costs more tokens, so I reserve it for high-stakes grading and use a single judge for the everyday pass.

Regression-test prompts and models

This is the payoff. Once you have a scored eval set, every change runs through it.

Changed a prompt? Run the suite. Pass rate went from 46 to 48, great. Went from 46 to 39, you just broke something; go look at which cases flipped. Considering a model swap, say from Opus to the cheaper Sonnet or Haiku to save money, or up to Fable for the hardest reasoning? Run the same suite on both and compare. Now "can we use the cheaper model" is a number, not an argument. I keep a known-good baseline score for each route so I can always tell whether today's run is better or worse than the last one I trusted.

The thing that makes this work is that the eval set is fixed. You are changing one variable, the prompt or the model, and watching the score move. That is an actual experiment.

Put evals in CI

The last step is to stop relying on yourself to remember to run them. Wire the eval suite into your pipeline.

On every pull request, run a fast, cheap subset as a gate, maybe ten to twenty cases that cover your core flows. It should finish in under a minute and block the merge if the pass rate drops below your threshold. Nightly, run the full suite, including the slow and expensive judge-panel cases, and post the score somewhere you will see it.

The rule I hold myself to: a prompt or model change does not ship until the evals are green. Treat the eval suite exactly like unit tests, because for an AI feature that is precisely what it is. The model is a dependency you do not control, and it can change underneath you; your evals are the alarm that goes off when it does.

None of this is glamorous. Building the case set is tedious, writing rubrics is fiddly, and verifying the judge feels like busywork until the day it catches a silent regression that would have shipped to every user. That day pays for all of it. If you want the upstream practices that make your outputs gradeable in the first place, see my notes on prompt engineering for production, and for how evals fit into the larger build, the building with LLMs field guide.

FAQ

What is an eval, in plain terms? An eval is a saved set of real inputs plus a way to score the model's output on each one. It turns "seems fine" into a number you can track across prompt and model changes.

How many eval cases do I need to start? Twenty to fifty real cases beats a thousand invented ones. Pull them from actual usage, weight them toward the failures and edge cases that scared you, and grow the set every time something breaks in production.

Is LLM-as-judge reliable enough to trust? For fuzzy qualities like tone or helpfulness, yes, if you give the judge a written rubric, a strong model, and a few graded examples. Spot-check the judge against your own ratings before you trust its scores at scale.

Do I need evals if I'm just writing prompts, not training models? Especially then. Prompt and model swaps are the most common silent regressions. An eval set is the only thing that tells you a "small tweak" didn't tank a case you forgot about.

How do evals fit into CI? Run a fast, cheap subset on every pull request as a gate, and the full suite nightly. Fail the build when the pass rate drops below a threshold you set from a known-good baseline.

#evals#testing
Want to apply this right now?

Use the free, no-API prompt generators to put it into practice.

Open Prompt Studio →
Keep reading