AI & ML

How I Evaluate LLM Apps: Evals From a Cited Document Q&A App (and One I Didn’t Evaluate)

You can't assertEquals a language model. Building DocQA, a Claude RAG app that cites its sources, taught me which failures a unit test still catches, which need real evals -- and why my other project, SelfMind, has none and shouldn't be trusted yet.

Dhileep KumarJun 10, 20267 min read

How I Evaluate LLM Apps: Evals From a Cited Document Q&A App (and One I Didn’t Evaluate)

You can't assertEquals a language model. I learned exactly what that means while building DocQA, a small Claude-powered RAG app where you drop in a PDF, ask a question, and get an answer grounded in the document with clickable [n] citations that jump back to the source passage. The whole point of the app is a promise: it only answers from the document. But 'only answers from the document' is not a thing a unit test can check. There is no single right string for 'what does this contract say about termination? ' -- there are a thousand valid phrasings and one very bad failure mode (making something up).

That gap -- between what I promised and what I could actually assert in code -- is where evals live. This post is the general discipline of evaluating LLM apps, told through two Claude projects I actually built: DocQA (a cited document Q&A app) and SelfMind (a memory + RAG assistant). One has a grounding contract that is begging to be evaluated. The other has no eval suite at all, and I want to be honest about why that is a problem.

Why traditional testing breaks (and where it secretly still works)

Conventional tests assume determinism: same input, same output, exact comparison. LLM outputs violate every part of that. The same prompt yields different text run to run, correctness is a spectrum rather than one expected value, and a response can be fluent, confident, and completely wrong without anything crashing.

But here's the part most 'evals are hard' posts skip: a surprising amount of your surface is still deterministic, and you should test it like normal software before reaching for anything fancy. In DocQA, the retriever, the chunker, and the citation renderer are all ordinary, testable code. The Okapi BM25 core is hand-written and dependency-free -- same k1 and b constants every time, so identical input gives identical ranking. That is a plain unit test, no model required.

const k1 = 1.5
const b = 0.75
for (const term of q) {
  const f = tf.get(term)
  if (!f) continue
  const n = df.get(term) || 0
  const idf = Math.log(1 + (N - n + 0.5) / (n + 0.5))
  const denom = f + k1 * (1 - b + (b * toks.length) / Math.max(avgdl, 1))
  score += idf * ((f * (k1 + 1)) / denom)
}

The lesson: draw a hard line between the parts of your LLM app that are deterministic and the parts that aren't. Test the deterministic parts with boring unit tests. Save the expensive, fuzzy eval machinery for the one thing that genuinely can't be asserted -- the model's actual answer.

The layers: cheap-and-strict at the bottom, expensive at the top

Good eval suites are layered. You run the cheap checks constantly and the costly ones deliberately. What made this concrete for me was noticing that DocQA already ships several of these layers as production guardrails -- the same checks you'd write as evals are the ones keeping the live app honest.

Deterministic checks. Does the output obey the contract's shape? DocQA's grounding prompt tells the model to cite sources inline as [n]; a deterministic check confirms every answer that makes a claim carries at least one citation and that no [n] points past the number of sources it was given.
Heuristic scoring. Cheap signals -- did the cited passage actually contain the answer's key terms? Rough, but good enough to flag a big regression in retrieval quality.
LLM-as-judge. Use a model to score an answer for faithfulness against the excerpts it was given. This layer catches the subtle 'fluent but unsupported' failure the cheap checks miss.
Human review. Periodically score a sample by hand to confirm your automated judge still agrees with an actual human. This is calibration, not throughput.

The trap is thinking you need the top layer first. You don't. Most of the failures I actually worried about in DocQA -- a dead citation chip, an answer with no source, a made-up source number -- are caught at the bottom two layers, for free, with no judge model and no dataset.

DocQA's grounding contract is the eval spec

Here is the thing I wish someone had told me earlier: your eval targets should fall out of the contract you already wrote in your system prompt. I didn't have to invent metrics for DocQA. The prompt already states the promise -- roughly, 'if the answer is not contained in the excerpts, say you could not find it; never use outside knowledge or guess. ' That single sentence defines two directly evaluable behaviors: faithfulness (every claim traces to an excerpt) and abstention (when the doc doesn't contain the answer, the app says so instead of guessing).

And critically, I did not trust the model to hold up its end. The citation renderer has a hard-coded guard: the model can emit a citation number higher than the number of sources it was actually given, so the renderer clamps any [n] where n exceeds the source count back to plain text. It never becomes a dead, hallucinated citation chip. That guard is also, conveniently, a perfect eval assertion -- 'no answer contains an out-of-range citation' is a check I can run on every output deterministically, and it's the exact class of failure an LLM-as-judge would be overkill for.

An eval suite is just a regression test that admits it can't be exact. The teams who trust their LLM features aren't smarter -- they wrote down the contract, then refused to ship a change that quietly broke it.

One more DocQA decision has real eval consequences. Context strategy switches on document size: documents under FULL_MODE_CHAR_LIMIT (120,000 characters) go into a prompt-cached system block where the model sees the whole text; larger documents fall back to BM25 retrieval of the top 8 passages. Those are two different code paths with two different failure profiles. In full mode the model sees everything, so a wrong answer is a reasoning failure. In retrieval mode it only sees the top-8 chunks, so a wrong answer might be a retrieval miss -- the right passage never made it into context. If you only eval one path, you're blind to half your app; a useful dataset has to straddle the 120k-char line on purpose.

A minimal eval you can write this afternoon

You don't need a platform to start. The core of an eval is a small dataset of inputs with expectations and a function that scores the output. For a grounded Q&A app like DocQA, I'd start with three buckets, because they map straight onto the failure modes the design already anticipates:

Answerable questions -- the fact is in the document. Expectation: correct answer, at least one in-range citation, and the cited passage actually supports the claim.
Unanswerable questions -- deliberately ask something the document does not cover. Expectation: the app abstains ('I couldn't find it'), because abstention is a first-class promise here, not an error.
Adversarial and edge inputs -- the empty query, the answer that spans two pages, the query that shares words with an irrelevant passage. These are where retrieval mode and the chunker earn or lose your trust.

Twenty hand-picked examples across those three buckets, drawn from documents you actually care about, beat a thousand imagined ones. Score them on every change that touches a prompt, the model, the chunker, or the retriever -- and when a real failure slips through, add it to the dataset so the same bug can never ship twice. That last habit is the whole game: your golden set should grow out of your incident history, not your imagination.

The honest part: SelfMind has no evals, and that should scare you

I want to end on the project I did not evaluate, because a post that only shows the tidy case is doing the same 'ship on vibes' thing it warns against. SelfMind is my other Claude app -- a REPL assistant that wraps Claude in a persistent memory loop. After every exchange it makes a second Claude call that reads the conversation back and extracts durable facts, preferences, and corrections as schema-constrained JSON, dedupes them, and writes them to a SQLite store so a brand-new session still recalls what it learned. Here is that 'learning' step, wrapped so a failed reflection never breaks the chat:

python

try:
    resp = self.client.messages.create(
        model=self.config.memory_model,
        max_tokens=self.config.memory_max_tokens,
        thinking={"type": "disabled"},  # mechanical extraction
        system=_EXTRACTOR_SYSTEM,
        output_config={"format": {"type": "json_schema",
                                  "schema": _MEMORY_SCHEMA}},
        messages=[{"role": "user", "content": exchange}],
    )
except Exception as exc:  # never let a failed reflection break the chat
    print(f"[memory extraction skipped: {exc}]")
    return []

Now look at how many judgment calls are buried in that one call, none of which a unit test can check. Did the extractor store the right durable facts and skip the transient ones (its own system prompt tells it not to store greetings, one-off task details, or the assistant's own suggestions -- but does it obey? ). Does the two-stage dedupe -- exact string match, then Jaccard token overlap at a 0.7 threshold -- collapse near-duplicate restatements without silently dropping a genuinely new fact? And does BM25 retrieval surface the right memory when the user's wording differs from how the fact was stored? BM25 is lexical: it matches shared words, so a memory stored as 'automobile' won't surface for a query about a 'car. '

Every one of those is an eval waiting to be written -- extraction precision and recall, dedupe correctness at exactly the 0.7 boundary, retrieval hit-rate for BM25 versus a semantic swap. And I have none of them. To be completely straight: SelfMind is an early prototype (version 0.1.0), it ships with zero benchmarks, and the only numbers in the whole repo are configuration constants -- top_k=6 memories injected per turn, the 0.7 dedupe threshold, the BM25 k1=1.5 and b=0.75. I have not measured whether the extraction is actually good. I've read the code and believe it's reasonable, which is precisely the 'seems fine' state this entire discipline exists to replace.

That's the real takeaway from building both. DocQA earned my trust because its promise was written down as a contract, then guarded in code and (in the parts that matter) checkable. SelfMind has a more ambitious promise -- that it learns you over time -- and no way to prove it keeps that promise. Evals don't make a language model deterministic, and they don't have to. They turn an opinion about quality into a number you can watch over time and defend in a review. Write the twenty examples. Straddle your code paths. Grow the set from real failures. You'll catch the regression before your users do -- and stop mistaking 'I read the code and it looks right' for 'I know it works. '

Enjoyed this?

Get the next deep dive in your inbox. No spam — just the stories worth reading.

Subscribe to the newsletter