JSON mode is a provider setting, usually response_format with type json_object, that constrains a model to return syntactically valid JSON instead of free-form prose. It guarantees the output will parse, but it does not guarantee the shape: fields can be missing, renamed, or carry the wrong type. Most providers also require the word JSON to appear somewhere in the prompt before they will honor the mode.

What is the difference between JSON mode and structured outputs?

JSON mode guarantees only that the output is valid JSON of some shape. Structured outputs, supplied as a JSON schema, guarantee that the output is valid JSON and conforms to your schema: the exact keys, types, enums, and required fields you declared. Structured outputs are enforced by constrained decoding, which restricts the model to tokens that keep the output schema-valid at every step, so the result cannot drift from the contract. Use JSON mode when any JSON object is acceptable, and structured outputs when downstream code depends on a fixed shape.

Why is my LLM not returning valid JSON?

If you only ask for JSON in the prompt without a format constraint, a weaker model often returns prose around the JSON, wraps it in a Markdown code fence, leaves string values unquoted, or trails off when it hits the token limit and never closes the final brace. Reasoning models add another failure: they emit a thinking trace before the answer, so a naive parser reading the whole response chokes on the prose prefix. The fix is to enable JSON mode or structured outputs on a model that supports it, give the response enough max_tokens to finish, and keep a validate-and-repair step as a backstop.

Which providers support structured outputs?

The major hosted providers all offer some form of it, with different field names. OpenAI exposes both a json_object mode and a strict json_schema mode. Google Gemini takes a responseMimeType of application/json plus a responseSchema. Anthropic does not expose a json_object switch and instead routes structured output through tool use, where you define a tool whose input schema is the shape you want and force the call. Self-hosted engines such as vLLM and SGLang implement schema-constrained decoding directly through grammar backends. Aggregators pass the response_format field through to whichever served model supports it.

Structured Outputs and JSON Mode: Getting Reliable JSON From an LLM

Most LLM features are built for a human reader. The model writes a paragraph, a person reads it, and a small imperfection does not matter. Structured output is the opposite case. Here the model writes something a program will parse, store, or pass to the next step, and a single malformed character breaks the whole pipeline. Extraction jobs, classification labels, tool arguments, form filling, and agent state all share this requirement. They need a JSON object of a known shape, every time, not prose.

The naive approach is to ask for it. You append "respond only with JSON" to the prompt and hope. On a strong model this works often enough to feel solved, and then it fails in production, usually in one of a few predictable ways. The model wraps the JSON in a Markdown fence, adds a sentence of explanation before it, leaves a string value unquoted, or runs into the token limit mid-object and never writes the closing brace. The result looks correct to a human and fails the moment a program tries to load it. This guide covers the mechanisms that turn "usually JSON" into "always valid JSON of the right shape," how the major providers expose them, and the failure modes that survive even when you do everything right.

Three levels of guarantee

There is a ladder here, and the rungs are easy to confuse because the vocabulary overlaps. Each level removes a specific class of failure and costs a little more setup than the one below it.

Prompt only is the request with an instruction and no machine-level constraint. The model is free to emit any tokens it likes, so the guarantee is zero. It can still be the right tool for a quick script against a frontier model, but it has no place in a path that must not break.

JSON mode is the first real constraint. You set a request flag, commonly response_format with a type of json_object, and the provider guarantees the output is syntactically valid JSON. No stray prose, no code fence, no unquoted values. What it does not guarantee is the shape. The model can return {"answer": "yes"} when you expected {"label": "approved", "confidence": 0.9}, and both are valid JSON. JSON mode removes parse failures but does nothing about contract failures.

Structured outputs close that gap. You supply a JSON schema, the keys, types, enums, and required fields you actually depend on, and the provider guarantees the output is valid JSON that conforms to the schema. This is the level that lets downstream code treat the response as a typed object rather than a hopeful guess. It is also the level where the mechanism becomes interesting, because the guarantee is not a check run after the fact. It is enforced while the model is still choosing tokens.

How structured outputs actually work

A language model generates one token at a time by sampling from a probability distribution over its entire vocabulary. Left alone, nothing stops it from sampling a token that makes the output invalid, a letter where the schema requires a digit, a comma where it requires a closing brace. Constrained decoding removes those tokens from contention before the sample is drawn.

The provider compiles your schema into a state machine, effectively a grammar that describes every valid continuation of the output so far. At each step, it intersects the model's distribution with the set of tokens the grammar allows from the current state, masking out everything else by driving its probability to zero. The model then samples only from the legal tokens. Because an illegal token is never available, the output cannot become invalid, and by the time generation ends the result satisfies the schema by construction.

This is why structured outputs are categorically stronger than asking nicely. A prompt-only request relies on the model having learned to produce well-formed JSON, while constrained decoding makes well-formed JSON the only thing the model is physically able to produce. The same machinery powers tool and function calling, where the schema is the tool's parameter definition, and it is the reason a forced tool call returns clean arguments even from a model that fumbles raw JSON. The relationship runs the other way too: tool calling is, mechanically, structured output with a function signature standing in for the schema.

How the providers differ

The capability is widely available, but the field names and the exact guarantees vary, so portable code has to account for the differences. The shapes below are the durable patterns; check each provider's current reference for precise field names.

OpenAI exposes two levels directly. A response_format of json_object gives plain JSON mode, and a response_format of json_schema with a strict flag gives schema-enforced structured outputs through constrained decoding. The same strict enforcement is available on function definitions.
Google Gemini takes a responseMimeType of application/json to switch into JSON output, and an optional responseSchema, expressed as an OpenAPI-style subset, to constrain the shape. Supplying the schema is what moves it from JSON mode to structured output.
Anthropic does not expose a json_object switch. The idiomatic path to guaranteed structure is tool use: define a tool whose input schema is the object you want, force the model to call it, and read the arguments. The result is the same schema-conforming JSON by a different door.
Self-hosted engines such as vLLM and SGLang implement schema-constrained decoding in the serving layer, usually exposed as a guided_json or structured-output parameter and backed by a grammar engine like XGrammar or Outlines. Because the constraint lives in the sampler, it works on any open model the engine serves, not only those trained for JSON.
Aggregators and gateways generally pass response_format through to the underlying model. The catch is coverage: the field is only honored if the specific served model and backend support it, so the same request can be enforced on one route and ignored on another. Routing a structured-output workload therefore has to treat format support as an eligibility constraint, not an afterthought, which connects it to the broader problem of model routing.

Why a model can still hand you broken JSON

Turning on the right mode removes the largest class of failures, not all of them. A few survive, and they are worth designing against.

JSON mode is not your schema. This is the most common disappointment. A json_object response is guaranteed to parse and guaranteed nothing else. If your code reads result["label"], JSON mode will not stop the model from returning result["classification"] instead. When the shape matters, you need the schema level, not the syntax level.

The prompt still has to mention JSON. Several providers will refuse to enable json_object mode unless the word JSON appears in the messages, a guard against the model emitting {} forever. A format flag set without a prompt that asks for JSON can error or silently do nothing.

Token limits truncate the object. Constrained decoding guarantees a valid prefix, but if the response hits max_tokens before the model writes the closing brace, you receive a syntactically incomplete object. Schemas with long string fields are the usual culprit. Budget output tokens for the largest plausible result, and treat truncation as a distinct error from a parse failure.

Reasoning models think out loud first. A reasoning model emits a thinking trace before its answer. Depending on the provider, that trace arrives in a separate channel or inline in the content, sometimes wrapped in tags. A parser that reads the entire response body and calls a JSON decoder on it will choke on the reasoning prefix even when the final answer is perfect. Read the answer channel, not the raw stream, and strip any inline thinking before parsing.

Support is uneven. Not every model on every backend honors the format flag. Send a structured-output request to a route that ignores it and you are back to prompt-only behavior without realizing it. This is why a validate step belongs in the pipeline regardless: parse the result, check it against the schema in your own code, and repair or retry on a miss rather than trusting the flag end to end.

Putting it together

The shape of a robust structured-output path is consistent across workloads. Prefer the schema level over bare JSON mode whenever downstream code depends on specific fields, which is almost always. The setup cost of writing the schema is paid back the first time it stops a silent shape drift from corrupting your data.

Keep schemas tight enough to be useful but avoid over-constraining. A very deep or heavily conditional schema can degrade answer quality, because the model spends its capacity satisfying the grammar instead of reasoning about the content. Give the response generous max_tokens so a long object can finish, and pick a model and backend you have confirmed honor the format. Keep a validate-and-repair backstop as well: parse, schema-check, strip stray fences or thinking traces, and retry on failure, so a single unsupported route or a truncated object never reaches the rest of the system as malformed data.

Structured output is one of the few LLM features where the right setup turns a probabilistic system into a near-deterministic one. The model still chooses the content, but it can no longer choose to hand you something your code cannot read. The Inferbase unified inference API exposes structured output and tool definitions consistently across the major providers, and the playground lets you try a schema against several model families before you wire it into a pipeline.

Structured Outputs and JSON Mode: Getting Reliable JSON From an LLM

Three levels of guarantee

How structured outputs actually work

How the providers differ

Why a model can still hand you broken JSON

Putting it together

Frequently asked questions

Have thoughts on this article?

Related Articles

Prompt Caching Explained: What Cuts Your Bill and What Breaks It

Sizing Llama 4 Scout for Production Inference

GPU Sizing Guide for LLM Inference in Production

Stay up to date

Start building with the right model.

Structured Outputs and JSON Mode: Getting Reliable JSON From an LLM

Three levels of guarantee

How structured outputs actually work

How the providers differ

Why a model can still hand you broken JSON

Putting it together

Frequently asked questions

What is JSON mode?

What is the difference between JSON mode and structured outputs?

Why is my LLM not returning valid JSON?

Which providers support structured outputs?

Have thoughts on this article?

Related Articles

Prompt Caching Explained: What Cuts Your Bill and What Breaks It

Sizing Llama 4 Scout for Production Inference

GPU Sizing Guide for LLM Inference in Production

Stay up to date

Start building with the right model.