AI & ML

Structured Outputs Are a Schema-Design Problem, Not a JSON Problem

Everyone frames structured outputs as "how do I stop the model wrapping JSON in a markdown fence. " That is the easy 5%. The hard part is designing a schema that stays honest when the model does not actually know the answer. Here is what changed my mind, using a real agent pipeline I work in.

Dhileep KumarJun 10, 20267 min read

Structured Outputs Are a Schema-Design Problem, Not a JSON Problem

The usual pitch for structured outputs goes like this: ask a model for JSON and it hands you prose, a code fence, and a trailing apology, so you flip on a "JSON mode" flag or attach a schema and now your parser stops throwing. True, useful, and about 5% of the actual problem. That part is solved by the platform. You turn it on and move on.

The 95% that keeps me up at night is different and almost nobody writes about it: once the output is guaranteed to parse, every remaining bug moves up a layer into the schema itself. A response that is structurally perfect can still be semantically garbage, and now it slides silently into your database because nothing threw. Structured outputs do not remove failure. They relocate it from "did this parse" to "is my schema honest about what can go wrong. " That relocation is the whole game, and it is a design problem, not a parsing problem.

I want to argue that point concretely, using a codebase I actually work inside: career-ops, an open-source job-search pipeline that runs on top of an AI coding CLI. It leans hard on structured outputs to let an LLM drive real automation, and the interesting decisions there are not "how do we get valid JSON" but "what shape do we force the model into so that a wrong answer is impossible to express. "

The mental model: a schema is a contract about what the model is allowed to say

Think of a schema not as a data format but as the set of sentences the model is permitted to utter. Free text lets it say anything. A tight schema deletes whole categories of wrong answer before generation even happens. That reframing is the single most useful thing I can hand you, because it turns vague prompt-tuning ("please be careful") into a design activity ("make the bad answer unrepresentable").

A quick worked example. Suppose you extract a shipping status and your first schema is a plain string field. The model can return "shipped", "Shipped", "in transit", "on its way", "idk lol". All parse. All are a headache downstream. Now change one thing: make it an enum of exactly {pending, shipped, delivered, unknown}. You have not written a smarter prompt. You have made "on its way" impossible to emit. The reliability came from the shape, not the wording. That is the lever, and most teams reach for a longer prompt when they should be reaching for a narrower type.

A prompt asks the model to behave. A schema removes the model’s ability to misbehave in a specific dimension. In automation, only the second one survives contact with production.

The gotcha nobody warns you about: give the model a legal way to say "I don’t know"

Here is the failure mode that structured outputs quietly introduce. You define required fields, you constrain the types, you feel safe. Then you hand the model a document that simply does not contain the answer. The schema says the field is required and must be a date. The model, forbidden from omitting it, invents one. Constrained generation did its job perfectly and produced a confident fabrication, because you never gave "I could not find this" a place to live in the shape.

This is the non-obvious rule that experience beats into you: every required field is also a command to hallucinate under pressure. The tighter you make the schema, the more urgently you need an explicit escape hatch. A nullable field, an "unknown" enum member, an optional confidence flag, or a discriminated union where one branch is literally { found: false }. Without it, a strict schema does not prevent bad data. It launders bad data into a well-typed object that your validator waves through.

career-ops has a clarifying instance of this. Its scanner classifies whether a job posting is still live, and headless Chromium sometimes trips a Cloudflare or hCaptcha wall instead of loading the page. The tiny challenge body looks like an empty posting. The naive shape has two outcomes, live or expired, and the challenge page falls through to "expired. " The consequence is nasty: the mislabeled-dead job gets written to history and permanently filtered out, so a real opening vanishes forever. The fix was not a better heuristic. It was widening the output space to include a third value, so a page that is neither clearly live nor clearly dead has an honest place to land.

javascript

// liveness-core.mjs — an anti-bot wall must NOT be read as "expired".
// Two outcomes (live/expired) is the wrong shape here; add a third.
if (botChallenge) {
  return {
    result: 'uncertain',            // the honest escape hatch
    code: 'bot_challenge',
    reason: 'anti-bot challenge: ' + botChallenge.source,
  };
}
// Without this, the short challenge body falls through to
// insufficient_content -> expired, and a live job gets
// permanently blacklisted in scan-history.

Same lesson as the enum, one level up. The bug was an output space too small to represent the truth. The output was always well-formed. It was just well-formed and wrong. You fix that in the schema, not the prompt.

When NOT to reach for one strict schema

Structured outputs are not free, and the reflex to schema-everything creates its own class of problems. A few places I have learned to resist it, or to shape it differently than the tutorials suggest.

One giant schema for a big extraction. A model filling forty fields in a single call is measurably worse per field than three focused calls each filling a handful. The right move is decomposition, not a bigger object. Small schemas are more reliable schemas, and they fail in isolation instead of tanking the whole payload.
Collapsing an uncertain judgment into a hard number. If a value is genuinely a qualitative assessment, forcing it into a numeric field launders opinion into false precision. Sometimes the honest schema is a tier or a label, kept deliberately separate from your quantitative fields.
Emitting a value just because the shape has a slot. If the correct answer is "nothing correct fits here," a good schema lets you emit nothing rather than the least-wrong option. Filling the slot with garbage to satisfy the type is worse than an explicit gap.
Trusting the model to write to a shared source of truth. Even validated output should not be handed the keys to a canonical store directly. Route it through a deterministic merge you control.

The second and third points both show up in career-ops in ways that changed how I think. Its offer evaluator produces a 1–5 fit score, but the separate legitimacy check for ghost or scam postings is deliberately kept as a qualitative tier — High Confidence, Proceed with Caution, Suspicious — and by design it does NOT fold into the numeric score. That is a schema decision with a spine: some things are judgments, and flattening them into a number to make the shape tidy would be dishonest. The output space keeps opinion and measurement in separate fields on purpose.

The third point is my favorite because it is counterintuitive. The tool’s resume generator normalizes fancy Unicode to plain ASCII so applicant-tracking parsers do not choke on em-dashes and smart quotes, and it spells out EUR and GBP. But it flatly refuses to touch the ¥ glyph, and the code says why: ¥ means both Japanese Yen and Chinese Yuan, so spelling out either code would be wrong for half the users. The honest move is to emit the raw glyph and no code rather than a confidently-wrong one. A slot existed; the right value for it was "leave it alone. " Most schemas never give you permission to do that, and they should.

The production rule: validate the output, but distrust the writer

The fourth bullet is the one that scales to agents, and career-ops enforces it in a way I now copy elsewhere. The agent is explicitly forbidden from adding rows to the canonical applications tracker. Even though its output is structured. Even though it validates. Instead the agent writes a plain 9-column TSV line into a staging directory, and a separate deterministic script owns the actual merge — deduping by company and role, swapping the column order the tracker uses versus the one the agent emits, and normalizing links. The AI proposes a well-shaped record. A boring, testable, non-AI function decides whether and how it lands.

text

num  date  company  role  status  score/5  pdf  report-link  note
# Agent writes ONE line in this fixed shape to a staging folder.
# It is NEVER allowed to edit the canonical tracker directly.
# merge-tracker.mjs owns dedup + column-swap + link-normalization.

That is the pattern worth internalizing: schema-constrained output makes the model’s reply safe to read, not safe to act on. Those are different guarantees. The schema is your parse-time contract; a deterministic merge or validator is your write-time contract. Skipping the second because the first passed is how two parallel agent workers race and corrupt a shared file with perfectly valid rows — the difference between "the JSON was fine" and "the system stayed correct. "

A short decision guide

Reach for JSON mode when you just need the reply to parse and you will validate the meaning yourself.
Reach for a schema / typed model when the fields and types matter and you want wrong shapes to be unrepresentable — and add an explicit unknown/null branch every time.
Reach for tool / function calling when the model should choose an action with typed arguments rather than describe one — this is the agent case.
Keep qualitative judgments as tiers or labels, separate from numeric fields, instead of flattening them for neatness.
Never let even validated output write straight to a canonical store — stage it and let a deterministic step own the merge.

None of this is exotic. It is the same idea applied at four altitudes: an enum instead of a string, an "uncertain" result instead of a forced live/dead, a tier kept out of a score, a staging file instead of a direct write. Every one of them is the model being handed a shape narrow enough that the wrong answer cannot be said, plus an honest slot for "I don’t know" so it is not cornered into inventing one.

So yes, turn on the JSON flag; strip the markdown fence; attach the Pydantic or Zod model. That gets you output that parses. The reliability you actually want comes from the next question, the one the tutorials skip: what is my schema quietly forcing the model to lie about, and where did I forget to give it permission to say "I don’t know"? Answer that, and structured outputs stop being a parsing trick and start being what they are — the contract that lets a language model become a component you can compose instead of a text generator you have to babysit.

Enjoyed this?

Get the next deep dive in your inbox. No spam — just the stories worth reading.

Subscribe to the newsletter