Reasoning models are a class of large language models that allocate substantially more compute to each request than standard LLMs, in exchange for higher accuracy on tasks that benefit from explicit step-by-step deliberation. Instead of producing a final answer in a single pass, they generate a long internal chain of thought before committing to a response. The chain itself is often hidden from the end user, but it is not free. Every reasoning token is billed at the model's output rate, which is why a single request to a reasoning model can cost ten to fifty times more than the same request to a standard model from the same provider.
The category emerged into mainstream use in late 2024 with OpenAI's o-series and consolidated through 2025 as Anthropic, Google, DeepSeek, Alibaba, and Meta shipped reasoning variants of their flagship models. The shift matters because it is the first broad commercial use of test-time compute as a deliberate scaling axis, distinct from the pretraining-compute axis that produced GPT-4, Claude 3.5, Gemini 1.5, and Llama 3. For teams building AI products, the cost and latency profile of reasoning models is different enough that they need their own selection logic, observability, and monitoring.
This post extends the guide to AI inference and the training versus inference comparison into the architectural and economic mechanics of test-time compute.
What test-time compute means
Test-time compute is compute allocated to a single request while the model is serving it, rather than compute amortized into the model weights during pretraining. Every token a language model generates costs one forward pass through the network, so a model that produces a thousand tokens has spent a thousand forward passes worth of compute on that one request. A reasoning model spends most of those forward passes on internal deliberation (planning, self-correction, verification) before committing to a final answer the user actually sees. The total compute for a single request therefore scales with the length of the reasoning trace, and the relevant variable for accuracy becomes how long the model is allowed to think, not just how large it is.
The distinction matters because train-time and test-time compute are independent axes. Train-time compute is paid once when the model is built and is then fixed for the model's lifetime. Test-time compute is paid per request and can be tuned at runtime, either implicitly by choosing a reasoning-tier model or explicitly by setting a reasoning-effort or thinking-budget parameter. This is the first time in the modern LLM era that an inference-side knob has produced quality gains comparable to scaling pretraining, which is why the category exists as its own product tier rather than as a deployment optimization.
The shift from pretraining scaling to test-time scaling
For most of the past five years, the dominant way to make a language model better was to make training larger. More parameters, more tokens, more compute, all spent during pretraining, produced a model with broader capabilities and stronger zero-shot performance. The scaling laws documented by Kaplan and colleagues at OpenAI, and refined by the Chinchilla paper at DeepMind, treated training compute as the relevant axis. Once a model was trained, every inference request was a single forward pass, and the only way to extract more accuracy was to switch to a larger model, fine-tune for a specific domain, or engineer a better prompt.
Reasoning models break this pattern by treating test-time compute as a deliberate scaling axis. A reasoning model with a hundred billion parameters running for ten thousand tokens of internal deliberation can outperform a substantially larger model that answers in a single pass, on tasks where the additional reasoning translates into useful state. The pretraining curve is not replaced; it is augmented by a second curve along which accuracy improves with reasoning-trace length until diminishing returns set in.
OpenAI's documentation around the o1 launch made the trade-off explicit: the model was specifically optimized to convert additional inference compute into improved performance on math, coding, and scientific reasoning benchmarks. Subsequent model families from other labs have followed the same pattern, with the size of the reasoning trace either configurable by the developer (Anthropic's extended thinking budget, OpenAI's reasoning_effort parameter) or implicit in the model variant chosen.
How reasoning models actually work
The mechanics behind reasoning models are not a separate architecture. They are still transformers with the same forward-pass machinery as any other LLM. What changes is how they are trained and what their generation procedure produces.
Reasoning capability is induced primarily through reinforcement learning on verifiable rewards. After standard pretraining, the model is fine-tuned on a corpus of problems whose solutions can be checked automatically: math problems with closed-form answers, code that compiles and passes tests, structured tasks with deterministic outputs. The reward signal rewards correct final answers, and the policy gradient updates push the model toward generating longer, more deliberative chains of reasoning when those chains lead to correct outcomes. Over many training iterations, the model learns to spend output tokens on planning, self-correction, backtracking, and verification, behaviors that resemble explicit chain-of-thought prompting but emerge from learned policy rather than prompt engineering.
The DeepSeek R1 paper, published in early 2025, was the first widely-circulated public account of this training procedure outside of OpenAI's labs. It described an RL-only stage that taught the base model to reason before any supervised fine-tuning on human-curated reasoning examples. This was the first public demonstration that the capability could be elicited without proprietary reward modeling pipelines. The paper also made clear that the reasoning trace is not a structured intermediate representation. It is just text, generated autoregressively, and the model has learned that producing more of it before answering improves expected reward.
At inference time, the model emits this reasoning trace as part of its normal token-generation loop and then produces a final answer. Whether the reasoning is shown to the end user is a product choice by the API provider. OpenAI hides the raw reasoning behind a summary, Anthropic exposes the thinking trace in API responses behind an opt-in flag, and DeepSeek and several open-weight providers expose it directly. In all cases, the reasoning tokens are produced and billed regardless of visibility.
The token economics of reasoning
The economic consequence of test-time compute is that output token usage explodes. A standard LLM answering a moderate question typically produces a few hundred output tokens. The same question answered by a reasoning model can produce three thousand to fifty thousand reasoning tokens before the final response, depending on problem difficulty and the configured reasoning effort.
This matters because output tokens are the most expensive component of LLM API pricing. Public list rates for reasoning model output tokens are typically two to four times higher than the equivalent non-reasoning model from the same provider, and the volume of output tokens per request is ten to fifty times higher. Multiplying the two factors gives a per-request cost on the order of twenty to two hundred times higher than the same query against a standard model. The ratio depends heavily on problem difficulty and configured reasoning budget; trivial questions sent to a reasoning model can produce reasoning traces that are wildly out of proportion to the value of the final answer.
A worked example illustrates the scale. A reasoning model priced at $15 per million output tokens that produces a 10,000-token reasoning trace plus a 500-token answer for a single request costs $0.16 in output tokens alone. The same request against a non-reasoning model priced at $5 per million output tokens producing a 500-token answer costs $0.0025. The reasoning request is sixty-four times more expensive, and the user receives the same final response. Whether the additional cost is justified depends entirely on whether the reasoning improved the answer in a way the user can verify and is willing to pay for.
For teams operating production AI features, this cost profile breaks the assumption that "use the best model" is a safe default. Routing every request to a reasoning model regardless of difficulty produces inference bills that are difficult to predict and disproportionate to the typical value of each response. Our breakdown of the hidden costs of LLM APIs covers the related effects of cache invalidation, output verbosity, and provider rate-limit interaction; reasoning token usage compounds with each of these.
The latency profile
Reasoning models change the latency budget of an AI request more than any other architectural choice. A standard LLM produces its first token in a few hundred milliseconds and streams the rest at fifty to two hundred tokens per second, so a typical response completes in a few seconds. A reasoning model produces tokens at the same per-token rate but generates ten to a hundred times more of them, with the additional caveat that nothing useful is visible to the user until the reasoning phase completes and the final answer begins.
In product terms, time-to-first-token loses its meaning as a UX metric for reasoning workflows. The relevant figure becomes time-to-final-answer, which can be thirty seconds to several minutes for hard problems. APIs that expose the reasoning trace make it possible to stream the model's thinking as user-visible feedback, which converts the latency from dead time into a partial product surface. APIs that hide the reasoning leave the request as a long-running operation that user interfaces must accommodate explicitly, often through asynchronous patterns that a typical chat UI does not natively support.
This latency profile makes reasoning models unsuitable for interactive workflows where a sub-second response is required. They are best suited for tasks where the user implicitly expects deliberation, such as code generation, research synthesis, document analysis, and structured problem solving. For request paths where latency dominates the user experience, routing logic should send most requests to a standard model and reserve reasoning calls for cases that meet a difficulty threshold.
When the cost is justified
The decision of whether to use a reasoning model for a given request reduces to whether the additional reasoning materially improves the output along an axis that matters to the user. Three categories of task consistently benefit from reasoning models:
- Mathematical and quantitative reasoning. Problems with verifiable answers (arithmetic, symbolic manipulation, applied calculation, proof construction) show large accuracy gaps between reasoning and non-reasoning models. The benchmark results reported on AIME, MATH, and GPQA Diamond demonstrate this consistently across providers.
- Code generation under nontrivial constraints. Tasks that require planning, multi-step refactoring, or correctness against test suites benefit measurably from reasoning. Single-file completion or boilerplate generation typically does not.
- Structured analysis and decomposition. Long-context tasks that require the model to integrate evidence across a document, plan a multi-step procedure, or evaluate trade-offs against multiple criteria benefit from explicit deliberation. Conversational chat, summarization, and retrieval-style question answering generally do not.
Tasks that are unlikely to benefit include extraction from short text, classification, simple question answering, formatting transformations, and any task where the answer is essentially looked up rather than reasoned about. Routing these to a reasoning model produces no quality lift and a substantial cost increase.
The practical pattern that has emerged in production deployments is two-tier routing. Lightweight requests go to a standard model, with classification logic (rule-based, regex, or a small classifier model) determining whether a request belongs in the reasoning tier. The classifier itself can be a small model whose inference cost is a small fraction of either tier. Our model selection guide covers the broader logic, and the same routing pattern extends naturally to the standard-versus-reasoning split.
Production observability for reasoning workloads
Reasoning model deployments require monitoring that standard LLM deployments often skip. The two most important new metrics are reasoning-token volume per request and reasoning-effort distribution.
Reasoning-token volume is the cleanest leading indicator of cost. Monitoring this per endpoint, per user cohort, and per request type reveals when reasoning effort is mismatched to task difficulty. Spikes in average reasoning length usually indicate one of three conditions: ambiguous prompts that cause the model to wander, requests routed inappropriately to the reasoning tier, or upstream changes (a new system prompt, a tool definition that confuses the model) that increase deliberation length without improving outputs.
Reasoning-effort distribution refers to how requests are spread across the configurable thinking budgets exposed by some providers. Anthropic exposes a budget_tokens parameter on extended thinking; OpenAI exposes reasoning.effort with low, medium, and high levels. Tracking the distribution of these settings in production and correlating with downstream quality metrics (user thumbs-up rates, regenerate rates, downstream task success) is the only reliable way to converge on the right setting for a given workload.
Cost monitoring should aggregate input tokens, output tokens, and reasoning tokens separately. Reasoning tokens are output tokens at the API level but represent a different operational cost driver, and lumping them into a single output-token line obscures the dynamics that matter for cost optimization. Our post on LLM benchmarks covers how published benchmark scores relate to production performance, which is particularly relevant for reasoning models since their reported numbers usually assume the highest reasoning-effort setting.
Summary table
| Dimension | Standard LLM | Reasoning Model |
|---|---|---|
| Compute axis | Pretraining scale | Pretraining + test-time |
| Output tokens per request | Hundreds | Thousands to tens of thousands |
| Cost per request (relative) | 1x | 20x to 200x |
| Latency profile | Seconds | Tens of seconds to minutes |
| Useful UX metric | Time-to-first-token | Time-to-final-answer |
| Best for | Conversation, extraction, classification, formatting | Math, code, multi-step analysis, planning |
| Configurable effort | No | Yes (e.g. budget_tokens, reasoning.effort) |
| Reasoning visibility | N/A | Provider-dependent (hidden, summary, or full) |
| Training procedure | Pretraining + RLHF | Pretraining + RL on verifiable rewards |
What this means in practice
Reasoning models are not a drop-in upgrade for existing AI workloads. They are a different product class with different cost, latency, and observability requirements, and routing all traffic to a reasoning model is the most expensive available answer. The teams that get the most leverage out of test-time compute are the ones that treat it as a tool for a specific problem class, route deliberately, monitor reasoning-token spend as a first-class metric, and tune the reasoning budget against measurable quality outcomes.
For product teams building AI features, the practical question is rarely "should we use a reasoning model" in the abstract. It is "which of our requests benefit from deliberation, and how do we route those to a reasoning model without sending the cheap ones." Answering that question requires understanding the underlying economics, the latency profile, and the architectural patterns that make selective reasoning cost-effective.
To benchmark reasoning and standard models against your own workload, the Inferbase playground supports side-by-side comparison across providers, and the unified inference API exposes reasoning-effort parameters consistently across the major reasoning model families.