Most LLM features are built for a human reader. The model writes a paragraph, a person reads it, and a small imperfection does not matter. Structured output is the opposite case. Here the model writes something a program will parse, store, or pass to the next step, and a single malformed character breaks the whole pipeline. Extraction jobs, classification labels, tool arguments, form filling, and agent state all share this requirement. They need a JSON object of a known shape, every time, not prose.
The naive approach is to ask for it. You append "respond only with JSON" to the prompt and hope. On a strong model this works often enough to feel solved, and then it fails in production, usually in one of a few predictable ways. The model wraps the JSON in a Markdown fence, adds a sentence of explanation before it, leaves a string value unquoted, or runs into the token limit mid-object and never writes the closing brace. The result looks correct to a human and fails the moment a program tries to load it. This guide covers the mechanisms that turn "usually JSON" into "always valid JSON of the right shape," how the major providers expose them, and the failure modes that survive even when you do everything right.
Three levels of guarantee
There is a ladder here, and the rungs are easy to confuse because the vocabulary overlaps. Each level removes a specific class of failure and costs a little more setup than the one below it.
Prompt only is the request with an instruction and no machine-level constraint. The model is free to emit any tokens it likes, so the guarantee is zero. It can still be the right tool for a quick script against a frontier model, but it has no place in a path that must not break.
JSON mode is the first real constraint. You set a request flag, commonly response_format with a type of json_object, and the provider guarantees the output is syntactically valid JSON. No stray prose, no code fence, no unquoted values. What it does not guarantee is the shape. The model can return {"answer": "yes"} when you expected {"label": "approved", "confidence": 0.9}, and both are valid JSON. JSON mode removes parse failures but does nothing about contract failures.
Structured outputs close that gap. You supply a JSON schema, the keys, types, enums, and required fields you actually depend on, and the provider guarantees the output is valid JSON that conforms to the schema. This is the level that lets downstream code treat the response as a typed object rather than a hopeful guess. It is also the level where the mechanism becomes interesting, because the guarantee is not a check run after the fact. It is enforced while the model is still choosing tokens.
How structured outputs actually work
A language model generates one token at a time by sampling from a probability distribution over its entire vocabulary. Left alone, nothing stops it from sampling a token that makes the output invalid, a letter where the schema requires a digit, a comma where it requires a closing brace. Constrained decoding removes those tokens from contention before the sample is drawn.
The provider compiles your schema into a state machine, effectively a grammar that describes every valid continuation of the output so far. At each step, it intersects the model's distribution with the set of tokens the grammar allows from the current state, masking out everything else by driving its probability to zero. The model then samples only from the legal tokens. Because an illegal token is never available, the output cannot become invalid, and by the time generation ends the result satisfies the schema by construction.
This is why structured outputs are categorically stronger than asking nicely. A prompt-only request relies on the model having learned to produce well-formed JSON, while constrained decoding makes well-formed JSON the only thing the model is physically able to produce. The same machinery powers tool and function calling, where the schema is the tool's parameter definition, and it is the reason a forced tool call returns clean arguments even from a model that fumbles raw JSON. The relationship runs the other way too: tool calling is, mechanically, structured output with a function signature standing in for the schema.
How the providers differ
The capability is widely available, but the field names and the exact guarantees vary, so portable code has to account for the differences. The shapes below are the durable patterns; check each provider's current reference for precise field names.
- OpenAI exposes two levels directly. A
response_formatofjson_objectgives plain JSON mode, and aresponse_formatofjson_schemawith a strict flag gives schema-enforced structured outputs through constrained decoding. The same strict enforcement is available on function definitions. - Google Gemini takes a
responseMimeTypeofapplication/jsonto switch into JSON output, and an optionalresponseSchema, expressed as an OpenAPI-style subset, to constrain the shape. Supplying the schema is what moves it from JSON mode to structured output. - Anthropic does not expose a
json_objectswitch. The idiomatic path to guaranteed structure is tool use: define a tool whose input schema is the object you want, force the model to call it, and read the arguments. The result is the same schema-conforming JSON by a different door. - Self-hosted engines such as vLLM and SGLang implement schema-constrained decoding in the serving layer, usually exposed as a
guided_jsonor structured-output parameter and backed by a grammar engine like XGrammar or Outlines. Because the constraint lives in the sampler, it works on any open model the engine serves, not only those trained for JSON. - Aggregators and gateways generally pass
response_formatthrough to the underlying model. The catch is coverage: the field is only honored if the specific served model and backend support it, so the same request can be enforced on one route and ignored on another. Routing a structured-output workload therefore has to treat format support as an eligibility constraint, not an afterthought, which connects it to the broader problem of model routing.
Why a model can still hand you broken JSON
Turning on the right mode removes the largest class of failures, not all of them. A few survive, and they are worth designing against.
JSON mode is not your schema. This is the most common disappointment. A json_object response is guaranteed to parse and guaranteed nothing else. If your code reads result["label"], JSON mode will not stop the model from returning result["classification"] instead. When the shape matters, you need the schema level, not the syntax level.
The prompt still has to mention JSON. Several providers will refuse to enable json_object mode unless the word JSON appears in the messages, a guard against the model emitting {} forever. A format flag set without a prompt that asks for JSON can error or silently do nothing.
Token limits truncate the object. Constrained decoding guarantees a valid prefix, but if the response hits max_tokens before the model writes the closing brace, you receive a syntactically incomplete object. Schemas with long string fields are the usual culprit. Budget output tokens for the largest plausible result, and treat truncation as a distinct error from a parse failure.
Reasoning models think out loud first. A reasoning model emits a thinking trace before its answer. Depending on the provider, that trace arrives in a separate channel or inline in the content, sometimes wrapped in tags. A parser that reads the entire response body and calls a JSON decoder on it will choke on the reasoning prefix even when the final answer is perfect. Read the answer channel, not the raw stream, and strip any inline thinking before parsing.
Support is uneven. Not every model on every backend honors the format flag. Send a structured-output request to a route that ignores it and you are back to prompt-only behavior without realizing it. This is why a validate step belongs in the pipeline regardless: parse the result, check it against the schema in your own code, and repair or retry on a miss rather than trusting the flag end to end.
Putting it together
The shape of a robust structured-output path is consistent across workloads. Prefer the schema level over bare JSON mode whenever downstream code depends on specific fields, which is almost always. The setup cost of writing the schema is paid back the first time it stops a silent shape drift from corrupting your data.
Keep schemas tight enough to be useful but avoid over-constraining. A very deep or heavily conditional schema can degrade answer quality, because the model spends its capacity satisfying the grammar instead of reasoning about the content. Give the response generous max_tokens so a long object can finish, and pick a model and backend you have confirmed honor the format. Keep a validate-and-repair backstop as well: parse, schema-check, strip stray fences or thinking traces, and retry on failure, so a single unsupported route or a truncated object never reaches the rest of the system as malformed data.
Structured output is one of the few LLM features where the right setup turns a probabilistic system into a near-deterministic one. The model still chooses the content, but it can no longer choose to hand you something your code cannot read. The Inferbase unified inference API exposes structured output and tool definitions consistently across the major providers, and the playground lets you try a schema against several model families before you wire it into a pipeline.