How does tool calling differ from prompt-engineered chain-of-thought?

Chain-of-thought is reasoning expressed as text in the model's output, generated and consumed in the same channel as the answer. Tool calling produces a structured intermediate output (a JSON object describing a function call) which is consumed by code outside the model and whose result re-enters the conversation as a new message. The two compose: a tool-calling model can emit chain-of-thought before deciding which tool to invoke, and a reasoning model can use tool calls inside its hidden reasoning trace. They are different mechanisms operating at different layers.

Why do tool calls fail and how do I make them more reliable?

The common failure modes are hallucinated tool names, invalid JSON syntax, missing or wrong-typed arguments, and silent omissions where the model produces a text answer when a tool call was needed. Reliability improves with structured-output enforcement (OpenAI strict mode, constrained decoding, schema-validated outputs), tighter and more discriminating tool descriptions, smaller and more focused tool registries, and validation logic that rejects malformed calls and asks the model to retry. Constrained decoding is the strongest mitigation; prompt engineering alone is not.

Is tool calling the same as building an agent?

No. Tool calling is the primitive. An agent is a control loop that runs tool-calling cycles until the model emits a stop signal or hits a budget limit. The agent layer adds the iteration logic, the tool registry, the error handling, the observability into the call sequence, and the prompt-injection defenses at the tool boundary. Frameworks like LangGraph, the OpenAI Agents SDK, and CrewAI provide the agent layer; the underlying provider API provides tool calling.

Does every model support tool calling?

The major frontier models from OpenAI, Anthropic, Google, Meta, Mistral, and Cohere support tool calling natively, with provider-specific schemas and reliability characteristics. Many open-weight models such as recent Qwen, DeepSeek, and Llama 3.1 and later variants ship with tool-calling capability trained into the policy. Older or smaller models often do not support tool calling natively and require prompt-engineered output templates that produce lower schema validity and weaker tool selection. When deploying tool-calling workloads, the model's tool-call schema validity rate should be tested against the actual tool registry rather than assumed from documentation.

What Is Tool Calling? How LLMs Invoke External Functions and Why Agents Depend On It

Q: What is tool calling?

Tool calling is the mechanism by which a large language model emits a structured function-call request as part of its output, instead of (or alongside) a text response. The application receives the structured request, executes the named function with the model's supplied arguments, and returns the result as a new message in the conversation. The model then continues, either issuing another tool call or producing a final answer. The mechanism is what allows LLMs to interact with external systems such as databases, APIs, calculators, and code interpreters rather than answering only from their training data.

Tool calling is the mechanism that lets a large language model produce structured function-call requests as part of its generation, which the application then executes and feeds back to the model as a new message. It is the mechanism by which an LLM reaches beyond the prompt: query a database, call an API, run a calculator, fetch a URL, search a vector store, or invoke any other piece of code the application chooses to expose. Without tool calling, a model can only answer from its training data and the immediate prompt context. With it, the model becomes a controller that issues structured requests, waits for results, and continues reasoning over the new information.

Every modern AI agent is built on top of this primitive. Frameworks like LangGraph, the OpenAI Agents SDK, CrewAI, and the various Anthropic agent patterns are orchestration layers around the same underlying capability: a model that can emit a tool call, an application that can execute it, and a conversation that can carry the result back into the model's context. The agent vocabulary often obscures the fact that the substrate is a single, well-defined feature of the API.

This post extends the guide to AI inference and the explanation of reasoning models into the protocol layer that connects models to external systems.

What tool calling actually is

At the protocol level, tool calling is a structured-output convention with a return channel. The application sends the model a normal prompt along with a list of tool definitions, where each definition includes a name, a textual description, and a JSON Schema for the arguments. During generation, the model decides whether to produce a normal text response or to emit a structured tool call: a JSON object containing a tool name and arguments matching the schema. The application parses this object, executes the named function with the supplied arguments, and returns the function's result as a new message in the conversation, typed as a tool result rather than a user turn. The model then continues from that point, either issuing another tool call or producing a final answer.

What makes the mechanism interesting is that the model is not running code or calling functions directly. The model only emits text. Tool calling is a convention by which a particular kind of structured text in the output stream is interpreted as a request for the application to execute something on the model's behalf. The application owns all the actual computation and side effects. The model owns only the decision of what to ask for and how to interpret the result.

The shape of a tool-calling exchange

A complete tool-calling cycle is multi-turn. The smallest useful exchange has three turns: the user prompt, the model's tool-call response, and the model's final answer after the tool result returns. In practice, exchanges are often longer because models chain multiple tool calls before producing a final response, especially when one call's output informs the arguments of the next.

The structure is identical across providers in the abstract, even though the field names differ. A typical sequence:

The application sends the user prompt and the tool registry. Each tool has a name, a description that the model uses to decide whether to call it, and a JSON Schema describing its arguments.
The model returns a response containing a tool-call object. The response includes the tool name, an argument dictionary that should match the schema, and a unique call ID used to associate this request with its eventual result.
The application validates the arguments against the schema, executes the underlying function, and constructs a new tool-result message that references the call ID and contains the function's return value (typed as text or JSON, depending on the provider).
The application sends the entire conversation back to the model, including the original prompt, the tool call, and the tool result.
The model issues another tool call, or produces a final natural-language response.

The conversation is stateful only at the application level. The model itself is stateless and reconstructs the exchange from the full message history on every turn. This has important cost implications, which we cover below.

A tool-calling exchange across user, LLM, and tool. The model emits a structured tool_call as output, the application executes the underlying function, and the result returns as a new message in the conversation. The model continues from the new state to produce a final answer.

How models learn to do this

Tool calling is a trained behavior, not a prompt-engineering trick. Provider-side teams fine-tune models on supervised examples of correct tool-use sequences and reinforce on tasks where successful tool selection and argument formatting earn higher reward. The resulting policy learns to recognize when a tool is appropriate, choose among the available tools based on the descriptions, and produce arguments that conform to the supplied schema with high reliability.

The capability shipped at scale in OpenAI's June 2023 function calling release, which made structured tool calls a first-class output mode in the API. Anthropic followed with tool use the next year, Google added function calling in Gemini, and most major open-weight providers (Meta's Llama 3.1+, Mistral, Qwen, DeepSeek) have since shipped tool-calling variants either as separate model names or as prompt-time capability flags.

Open-weight models without native tool-calling training can still produce tool-call-formatted output through prompt engineering: instructions in the system prompt that ask the model to wrap function calls in a particular tag or JSON structure. The output still parses, but schema validity rates and tool-selection accuracy are materially lower than for natively-trained models, and the application code has to do more validation and recovery work as a result. This gap is the primary reason production deployments tend to default to natively-trained tool-calling models even when a smaller, cheaper open-weight option exists.

Provider differences that matter

The category is not standardized at the protocol level. Every major provider exposes the same logical mechanism with different field names, different result formats, and slightly different reliability guarantees. The differences matter when an application needs to support multiple providers or move workloads between them.

OpenAI's API takes a tools array of function definitions and emits tool_calls in the assistant message. The platform exposes a strict mode that constrains decoding to guarantee the output matches the supplied JSON Schema, which substantially raises schema-validity rates at the cost of slightly slower generation. Multiple parallel tool calls are supported in a single response.

Anthropic's API uses tools and emits tool_use content blocks alongside any text content the assistant produces. Tool calls and text can be interleaved in the same response. Parallel tool calls are emitted as separate content blocks, and the application returns each result as a corresponding tool_result content block in the next user turn.

Google's Gemini exposes functionCalls and functionResponses with a similar structure but distinct field names. Mistral, Cohere, and the major open-weight model providers all support tool calling, but the schemas, result formats, and reliability characteristics vary further.

This divergence is why most production deployments end up with an abstraction layer that normalizes the schema across providers and exposes a single tool-call interface to the application. Doing this badly produces a leaky abstraction that has to be patched every time a provider releases a new feature; doing it well requires treating provider-specific quirks as first-class state. Our model selection guide covers the broader portability question.

Reliability characteristics

The reliability of tool calling is measured along three axes: schema validity (does the output parse), tool selection (did the model choose the right tool), and argument correctness (are the values reasonable for the task). All three are imperfect, and each fails in characteristic ways.

Schema validity is the easiest to harden. Strict mode and constrained decoding (the technique behind it) restrict the token-level decoder to outputs that match the supplied JSON Schema, which moves schema-validity rates from approximately 90 to 95 percent on free-form generation to effectively 100 percent on natively-supported providers. The trade-off is a small latency cost and a more constrained set of providers that support the feature.

Tool selection is the hardest to harden. The model decides which tool to invoke based on the textual description in the tool definition and on the conversational context, and selection errors look like: calling no tool when one was needed, calling the wrong tool when several have similar descriptions, or producing a hallucinated tool name not present in the registry. Mitigations include tighter and more discriminating tool descriptions, smaller registries (registry size correlates with selection error rate), and explicit instructions in the system prompt about when to call tools versus answer directly.

Argument correctness is the subtlest failure mode. The arguments parse and the right tool is called, but the values are wrong: an out-of-range integer, a malformed identifier, a date in the wrong format, or a fabricated value extracted from the prompt incorrectly. Mitigations are application-side: validate arguments against business rules before execution, return clear error messages when validation fails (so the model can recover), and log argument distributions to detect drift over time.

Cost and latency profile

Tool calling changes the cost and latency profile of an LLM workload because every cycle is a full request with the entire conversation history. A three-turn exchange (prompt, tool call, final answer) is two billed inferences against the model. A five-turn agent that issues four tool calls before answering is four billed inferences, and each one carries the cumulative input tokens of every prior message including all previous tool results.

Input tokens grow superlinearly with conversation depth because each turn adds the prior turn's output to the input of the next. A long-running agent that accumulates several thousand tokens of tool results per turn can spend the majority of its inference budget on input tokens that the model has effectively already seen. Prompt caching, where the provider supports it, recovers most of this cost on the cached prefix, but the marginal cost of each new turn still includes the new tool result and any output the model produces.

Latency stacks linearly. Every cycle adds one model forward pass plus the execution time of the tool itself. A five-turn agent calling external APIs that each take 200 milliseconds returns its final answer no faster than 1 second of tool execution plus the cumulative model latency. Parallel tool calls (where the provider and the model support them) reduce model-side latency for tool calls that do not depend on each other, but the tool-side latency is bounded by the slowest call in the parallel batch. Our breakdown of the hidden costs of LLM APIs covers the related multi-turn cost dynamics.

The agent connection

An agent, in the modern usage of the term, is a control loop that runs tool-calling cycles until the model emits a stop signal or hits an iteration limit. Tool calling is the substrate; the agent is the loop and the orchestration logic around it. Frameworks like LangGraph, CrewAI, and the OpenAI Agents SDK provide the loop, the tool registry, the error handling, and observability layers; the underlying provider API provides the tool-call mechanism.

The earlier academic framing (the ReAct paper, the MRKL paper) treated reasoning and acting as distinct steps in a loop: think, act, observe, think again. Modern provider APIs collapse this into a single primitive that ships in the API itself, so the framework's job is no longer to elicit tool-call behavior from the model through prompting but to organize the loop, validate inputs, manage retries, and expose telemetry. The interesting engineering problems are now in the orchestration layer, not in the substrate.

This distinction matters because it clarifies what an agent framework is buying. The framework is not making the model better at tool calling; the model already does that natively. The framework is organizing the multi-turn conversation, providing safety rails, and instrumenting the system enough to debug it in production. A well-designed application can do the same work with direct provider-API calls, and many production deployments do exactly that for the operational simplicity.

What you need beyond tool calling

The tool-call primitive on its own is insufficient for a production system. The pieces a real deployment needs include:

A tool registry with schema validation. Tool definitions should be the source of truth. Argument validation should reject malformed calls before execution and return errors the model can recover from.
Loop control with iteration and budget limits. Agent loops can run unbounded if the model fails to converge. Hard caps on iteration count, total tokens, and total tool calls prevent runaway costs.
Error recovery. Tool failures (timeouts, invalid arguments, downstream errors) should produce structured error messages that the model can parse and react to. Returning raw exceptions to the model usually produces worse recovery behavior than a clean error string.
Observability into the tool sequence. Production debugging requires access to the full multi-turn sequence: every tool call, every result, every model reasoning step. Logging only the final answer makes failures effectively undebuggable.
Prompt-injection defense at the tool boundary. Any tool that returns content from external systems (web search, document retrieval, email contents) is a prompt-injection vector. Tool results should be sanitized or sandboxed before the model reads them.
State management for multi-step workflows. Long-running agents that span minutes or hours need durable state, often outside the conversation, so that crashes and retries do not corrupt the loop.

These layers compose with the underlying tool-calling capability. None of them are part of the model API itself. All of them are parts of any production system that uses tool calling at scale.

Summary table

Dimension	Standard text completion	Single tool call	Agent loop
Turns	1	3 (prompt, tool, answer)	5 to 50+
Billed inferences	1	2	4 to 50+
Input token growth	Linear in prompt	Linear in prompt + tool result	Superlinear (history accumulates)
Latency	Single forward pass	Forward pass + tool latency	Cumulative across turns
Failure modes	Hallucination, format drift	Above, plus tool-call schema and selection errors	Above, plus loop divergence and state corruption
Observability needs	Output logs	Output logs + tool calls	Full call sequence + state
Application complexity	Minimal	Moderate (tool registry, schema validation)	High (orchestration, recovery, observability)

What this means in practice

Tool calling is now the integration substrate for AI products that need to interact with anything outside the prompt. The mechanism itself is well understood and broadly supported; the engineering work has moved up the stack to reliability, observability, and cost management. Teams building AI features should expect to spend most of their effort on the orchestration layer, not on getting the model to emit tool calls.

For multi-provider deployments, the most common operational issue is the schema divergence between providers. A single application that serves traffic to two or three model families ends up writing per-provider adapters, normalizing tool schemas, and reconciling field-name differences in tool results. Gateway products that expose a unified tool-calling interface across providers absorb this complexity at the platform layer.

To prototype tool-calling workloads against multiple model families, the Inferbase playground supports tool definitions and structured outputs, and the unified inference API exposes tool calling consistently across the major providers.

What Is Tool Calling? How LLMs Invoke External Functions and Why Agents Depend On It

What tool calling actually is

The shape of a tool-calling exchange

How models learn to do this

Provider differences that matter

Reliability characteristics

Cost and latency profile

The agent connection

What you need beyond tool calling

Summary table

What this means in practice

Frequently asked questions

Have thoughts on this article?

Related Articles

What Is Retrieval-Augmented Generation? How RAG Works and Why Most Production LLM Apps Use It

What Are Small Language Models? Where the Sub-10B Tier Earns Its Keep and Where It Breaks

What Are Reasoning Models? How Test-Time Compute Works and Why It Costs More

Stay up to date

Start building with the right model.