AI & ML

Reasoning Models: Stop Pricing Them Per Token, Price Them Per Correct Answer

A reasoning model that costs 6x per token but turns a 60% success rate into 95% can be cheaper than the 'cheap' model once you count the retries, the human review, and the wrong answers that ship. Here's the mental model, a worked cost example, a real router, and the production gotchas the docs skip.

Dhileep KumarJun 13, 20267 min read

Reasoning Models: Stop Pricing Them Per Token, Price Them Per Correct Answer

Every debate about reasoning models -- o1, o3, DeepSeek-R1, Claude's extended thinking, Gemini's thinking mode -- gets framed the same lazy way: they cost more per token and they're slower, so use them sparingly. True, and useless. It's the same advice as 'hire senior engineers sparingly, they're expensive. ' Expensive compared to what, measured against which outcome?

The framing that actually helps you make decisions is this: a reasoning model doesn't sell you tokens, it sells you a higher probability of a correct answer on the first try. The token bill is just the sticker price. The number that matters is cost per correct, accepted answer -- including everything the wrong answers cost you downstream. Once you price it that way, the 'expensive' model is sometimes the cheap one, and the cheap model is sometimes a slow-motion way to burn money on retries and cleanup.

A reasoning model is not a more expensive way to generate text. It's a machine that converts extra latency and tokens into fewer wrong answers. If wrong answers are cheap for you, skip it. If they're expensive, it might be the frugal choice.

The mental model: you're buying down an error rate

Here is the shift. A fast model gives you some success rate p on your task -- say it gets 6 out of 10 hard extraction jobs exactly right. A reasoning model spends test-time compute (that hidden chain of thought you pay for) to push p higher -- say 9.5 out of 10. You are not buying tokens. You are buying a reduction in your failure rate from 40% down to 5%.

Whether that trade is worth it depends entirely on one question the token-price debate never asks: what does a single wrong answer cost you? Not the token cost of generating it -- the total cost of it being wrong. That's the missing variable, and it swings the decision more than any per-token price ever will.

A wrong answer in a throwaway brainstorm costs ~nothing. You glance, discard, move on. Reasoning model = waste.
A wrong answer in a batch pipeline costs a retry, or a bad row that quietly poisons a downstream table nobody audits for weeks.
A wrong answer in an autonomous agent's plan costs every action taken on that wrong plan before something catches it -- sometimes a lot.
A wrong answer a human has to catch and redo costs 5-15 minutes of an expensive person's attention, every single time.

That last line is the one teams systematically forget. If a human reviews the output anyway, the wrong answers aren't free -- they're the most expensive tokens in your whole system, because they're denominated in salary, not API credits.

A worked example: when the 'expensive' model is cheaper

Let's make this concrete with illustrative numbers -- treat these as a rough model to plug your own figures into, not measured benchmarks. Say you're extracting structured data from 10,000 messy invoices. Every extraction that's wrong gets caught in review and costs a person about 8 minutes to fix, and you value that person's time at roughly USD 60/hour, so a wrong answer costs about USD 8 in human cleanup.

Now compare two options on the metric that matters, cost per accepted answer:

text

FAST MODEL
  Success rate ................. 80%  -> 2,000 wrong out of 10,000
  API cost per call ............ USD 0.002   (say 4,000 tokens)
  API cost total ............... 10,000 x 0.002   = USD 20
  Human cleanup ................ 2,000 x USD 8     = USD 16,000
  TOTAL ........................ USD 16,020

REASONING MODEL
  Success rate ................. 97%  -> 300 wrong out of 10,000
  API cost per call ............ USD 0.02    (~6x tokens, higher rate)
  API cost total ............... 10,000 x 0.02    = USD 200
  Human cleanup ................ 300 x USD 8       = USD 2,400
  TOTAL ........................ USD 2,600

The reasoning model's API bill is 10x higher -- USD 200 versus USD 20 -- and it is still more than six times cheaper overall. The entire token-price argument was arguing about the USD 180 difference while ignoring the USD 13,600 difference sitting right next to it. Change one assumption and the answer flips: if nobody reviews the output and wrong answers cost nothing, the fast model wins by USD 180 and the reasoning model is pure waste. Same two models, opposite verdict, and the deciding factor was never the token price.

So the first thing to compute is not 'which model is cheaper per token. ' It's 'what does a wrong answer actually cost me, and how many fewer of them do I get. ' If you can't estimate those two numbers, you're not ready to pick a model -- you're guessing.

The decision table

Reach for a reasoning model when the task has a verifiable right answer that takes several correct steps to reach, and being wrong is expensive. Skip it when either half of that is false -- when there's no single right answer, or when wrong is cheap.

text

USE A REASONING MODEL            USE A FAST MODEL
-----------------------------   -----------------------------
Multi-step math / logic         Lookups, classification
Complex code with a spec        Boilerplate, autocomplete
Agent PLANNING step             Agent's many EXECUTION steps
Wrong answer is costly          Wrong answer is cheap / caught
Batch job, latency is fine      Live chat, latency is the UX
Answer is checkable             Taste / tone / open-ended
Ambiguity needs untangling      Request is already unambiguous

The agent row is the one worth internalizing. The winning pattern is not 'use a reasoning model for the agent. ' It's split-brain: a reasoning model writes the plan once, and a fast model executes the twenty dumb steps that plan generates. Route the whole agent through a reasoning model and you pay the thinking premium twenty times to click buttons that needed no thought.

A router you can actually reason about

Everyone tells you to 'route by difficulty. ' Almost nobody tells you the router itself is a trap, because the obvious way to build it defeats the entire point. Here's a minimal one, followed by why the naive version is worse than no router at all.

python

def choose_model(task):
    # Cheap, DETERMINISTIC signals first. No LLM call here.
    if task.type in ('lookup', 'classify', 'format'):
        return FAST
    if task.requires_multi_step or task.is_agent_plan:
        return REASONING
    if len(task.input_tokens) > LONG_THRESHOLD:
        return REASONING

    # Default to fast. Escalate on FAILURE, not on a hunch.
    return FAST

def run(task):
    model = choose_model(task)
    out = call(model, task)
    # The router's real value: a verifiable retry, not a vibe.
    if model is FAST and not passes_checks(out, task):
        out = call(REASONING, task)   # escalate once, then stop
    return out

The trap: teams reach for an LLM to classify difficulty -- 'ask a model whether this needs the smart model. ' If that classifier is itself a reasoning model, you've paid the premium on 100% of traffic to decide whether to pay it again. If it's a fast model, it's exactly as bad at judging hardness as the fast model is at the task, so it mislabels the very cases you built the router to catch. Difficulty routing is cheapest and most reliable when it's a boring deterministic check on task metadata, and expensive when it's another model guessing.

The more robust move is the second half of that code: don't predict difficulty, detect failure. Run the fast model, check the output against something real -- does the JSON parse, does the code compile, does the sum reconcile, does the answer cite a source that exists -- and escalate only what fails. You pay the premium exactly on the hard cases, identified by ground truth instead of a guess. The catch is that this only works when you have a cheap verifier. If you can't cheaply check whether an answer is right, difficulty-detection collapses back into difficulty-prediction, and you're guessing again.

Gotchas the docs won't put in bold

These are the failure modes that are reasoning-specific, the ones you tend to learn the expensive way rather than from the quickstart.

Uncapped thinking is a silent budget leak. Reasoning tokens don't stream, so a model chewing 20,000 hidden tokens on one adversarial prompt looks identical to one that's stuck. Set a reasoning-effort or max-thinking cap, and alarm on p99 thinking length -- an unbounded loop won't announce itself.
Your prompt engineering partially inverts. The 'let's think step by step' scaffolding you added for fast models is now redundant, and worse, over-instructing a reasoning model on HOW to think can fight its trained reasoning and lower quality. Give it the goal and constraints; stop narrating the method.
Never show or persist the raw chain of thought. It's verbose, occasionally unhinged, and can surface content the final answer correctly suppressed. Show users the answer. If you log the reasoning for debugging, treat that log as sensitive and access-controlled.
Time-to-first-token quietly breaks your UX assumptions. A pause that reads as 'thinking' at 5 seconds reads as 'frozen' at 20. If you must use one live, stream a status affordance during the silence, or you'll get abandonment that looks like a quality problem but is really a latency problem.
No fast fallback means a single slow dependency becomes a full outage. Reasoning endpoints are slower and rate-limit harder under load. If your only path is the reasoning model, a spike doesn't degrade you gracefully -- it stalls the whole queue. Keep a fast model wired in as the pressure-release valve.
Determinism drops. Longer reasoning traces mean more places to diverge, so the same prompt varies more run to run. If you cache, snapshot outputs for tests, or diff responses in CI, budget for that variance instead of being surprised by it.

The one-sentence version

Stop asking 'is this model too expensive per token' and start asking 'what does a wrong answer cost me, and does this model give me enough fewer of them to pay for itself. ' Reasoning models didn't make fast models obsolete -- they gave you a second gear for the exact moments when being right is worth waiting and paying for. The skill isn't picking the smart model or the cheap model. It's knowing, for the problem in front of you, which kind of wrong you can afford.

Enjoyed this?

Get the next deep dive in your inbox. No spam — just the stories worth reading.

Subscribe to the newsletter