How to Evaluate LLM Apps: Evals That Catch Failures Before Production
You can’t assertEquals a language model. That’s why teams ship LLM features blind and find the regressions in production. Evals are the missing discipline — here’s how to build ones that actually catch failures.
Here’s the uncomfortable thing about shipping software powered by a language model: you can’t write the test. There’s no assertEquals for “summarize this ticket” — there are a thousand valid summaries and no single right string. So a lot of teams skip testing entirely, ship on vibes, and find out a prompt tweak broke their extraction logic when a customer does.
That’s not a rare failure; it’s the common one. Surveys in 2026 put it bluntly — a large share of LLM applications hit serious quality regressions within the first ninety days of production, and the through-line is almost always the same: no systematic evaluation. Evals are the discipline that closes that gap. They’re how you turn “seems fine” into something you can measure, gate on, and trust.
Why traditional testing breaks
Conventional tests assume determinism: same input, same output, exact comparison. LLMs violate every part of that, which is why dragging unit-test habits over wholesale produces either brittle tests that fail on harmless wording changes or no tests at all.
- Outputs are non-deterministic. The same prompt yields different text run to run; exact-match assertions are meaningless.
- There are many right answers. Correctness is a spectrum — faithful, relevant, well-formed — not a single expected value.
- Failures are subtle. A response can be fluent, confident, and wrong. Nothing crashes; the quality just quietly drops.
- Behavior drifts with everything. A new model version, a reworded prompt, a changed retrieval source — any of them can move quality without a line of your code changing.
The four layers of evaluation
Good eval suites aren’t one technique; they’re layers — cheap-and-strict at the bottom, expensive-and-nuanced at the top. You run the cheap ones constantly and the costly ones deliberately.
- Deterministic checks. Is it valid JSON? Does it contain the required field? Is it under the length limit? Fast, free, and catches a surprising number of real breaks.
- Heuristic scoring. Cheap signals — keyword presence, regex matches, embedding similarity to a reference. Rough, but good enough to flag big regressions.
- LLM-as-judge. Use a model to score outputs on faithfulness, relevance, or tone against a rubric. This handles the nuance traditional metrics miss — and it’s what most modern eval tooling is built around.
- Human review. The calibration layer. Periodically have people score a sample to confirm your automated judges still agree with actual humans.
A minimal eval in code
You don’t need a platform to start. The core of an eval is a dataset of inputs with expectations and a function that scores the output. Here’s the shape of an LLM-as-judge check — a golden dataset, your app’s output, and a model grading the result against a rubric.
# eval.py - the essence of an LLM-as-judge eval.
# A golden dataset: inputs plus what a good answer must contain.
golden = [
{"input": "Reset my password", "must_mention": "reset link"},
{"input": "Cancel my order", "must_mention": "refund"},
]
def judge(answer: str, must_mention: str) -> bool:
prompt = f"Does this reply mention {must_mention}? yes or no. Reply: {answer}"
verdict = call_model(prompt) # your judge model
return verdict.strip().lower().startswith("yes")
passed = 0
for case in golden:
answer = my_app(case["input"]) # the system under test
if judge(answer, case["must_mention"]):
passed += 1
print(f"score: {passed}/{len(golden)}")That’s the whole idea, minus the polish. Real tooling — DeepEval, Promptfoo, and friends — adds metrics, parallelism, and reporting, but the heart is exactly this: a curated set of cases and a repeatable score. Start with twenty hand-picked examples that represent your real traffic; a small, honest dataset beats a giant noisy one.
An eval suite is just a regression test that admits it can’t be exact. The teams who trust their LLM features aren’t smarter — they just refused to ship without a number.
Make it a CI gate
An eval you run by hand when you remember is theater. The value shows up when it runs automatically and can block a bad change before it merges.
- Curate a golden dataset of 50-200 examples drawn from real inputs, with a clear expectation for each.
- Wire the evals into CI so every pull request that touches a prompt, model, or retrieval path runs them.
- Set a threshold — say, no regression below 90% on the suite — and fail the build if it drops.
- Sample production traffic and score it continuously; offline evals miss what real users actually send.
- Feed every production failure back into the golden dataset, so the same bug can never ship twice.
Where teams get it wrong
- Waiting for the perfect dataset. Twenty real examples today beat a thousand imagined ones next quarter. Start small and grow it from failures.
- Trusting the judge blindly. LLM judges have biases — they favor longer answers and their own style. Calibrate against human scores periodically.
- Only testing the happy path. Your real risk is the weird input, the adversarial prompt, the empty result. Put those in the dataset on purpose.
- Evaluating offline and never in production. The distribution of real traffic always surprises you; monitor live outputs, not just your test set.
- Measuring nothing because you can’t measure everything. A rough score that runs every day beats a perfect framework you never finish.
Evals don’t make a language model deterministic, and they don’t have to. They make its behavior observable — they turn an opinion about quality into a number you can watch over time and defend in a review. That’s the difference between a team that hopes its AI feature still works and one that knows. Write the twenty examples. Run them on every change. You’ll catch the regression before your users do.
Enjoyed this?
Get the next deep dive in your inbox. No spam — just the stories worth reading.
Subscribe to the newsletter