What is AI inference in simple terms?

AI inference is the process by which a trained machine learning model produces an output in response to new input. When a chatbot answers a question, a search system returns ranked results, or an image model generates a picture, inference is the step that runs the saved model against the request. Training built the model once; inference runs it every time someone uses it.

How is AI inference different from AI training?

Training is the one-time process of teaching a model from data, which produces a fixed set of weights. Inference is the ongoing process of running those weights against new inputs to produce outputs. Training cost is capital-like (paid once per model version), while inference cost is operating-like (proportional to the number of requests served over the model's lifetime). Different hardware configurations, software stacks, and optimization priorities apply to each phase.

What does it cost to run AI inference?

Inference cost depends on the model, the provider, the workload type, and the access pattern. Typical ranges on commercial LLM APIs run from under $1 per million tokens for small models to $75+ per million output tokens for frontier reasoning models. Self-hosted inference on rented GPUs can be cheaper at high volume but adds operational overhead. Batch processing tiers, prompt caching, and multi-provider routing can reduce effective cost by 40-70% compared to standard per-call rates.

Do you need a GPU for AI inference?

Large language models and most modern vision models require GPUs for acceptable inference latency, typically NVIDIA H100, H200, or B200 cards for frontier models, with smaller cards sufficient for smaller models. CPU-only inference is feasible for small models (under a few billion parameters) and is used in edge and on-device scenarios, but cloud and production inference at scale is dominated by GPU serving. Emerging accelerators from AMD, Google (TPU), and AWS (Trainium/Inferentia) are also used in specific deployments.

What's the difference between running inference locally vs using an inference API?

Running inference locally or self-hosted means operating the full stack: renting GPUs, deploying a serving engine like vLLM, and managing scaling and reliability. This gives full control and can be cheapest at very high volume. Using an inference API means calling a provider (like OpenAI, Anthropic, or a unified platform like Inferbase) that handles the infrastructure. APIs are faster to ship with and remove operational overhead, in exchange for per-call pricing and dependence on the provider's reliability and rate limits.

What is AI Inference? A Complete Guide

AI inference is the process by which a trained machine learning model produces an output in response to new input. When a chatbot answers a question, a search system returns a ranked result, or an image model generates a picture, inference is the step that turns a saved set of model weights into the response the user sees. Training built the model once; inference runs it every time someone uses it.

The term has moved from infrastructure-paper jargon into general engineering vocabulary for a specific reason. Models became useful enough that serving them at scale became a distinct problem class, and the cost, latency, and reliability characteristics of that serving layer now dominate the economics of any AI-powered product. The GPU cluster that trained a frontier model runs once. The inference layer that answers every user request runs forever.

What inference actually means

The cleanest way to think about inference is as the runtime execution of a frozen model. During training, the model's weights change as it learns from examples. Once training finishes, those weights are saved. Inference is the phase that takes those fixed weights, combines them with a new input, and computes an output. Nothing about the model itself changes during inference. It is a read-only forward computation.

For large language models specifically, a single inference request decomposes into four stages that each impose their own cost and latency profile:

LLM inference pipeline: a prompt is tokenized, processed in a parallel prefill pass, then generated token-by-token during the autoregressive decode stage before being detokenized back into text

Tokenization. The input text is split into tokens (units roughly the size of a syllable or short word) that the model's vocabulary can represent numerically. A 500-word document becomes somewhere between 600 and 800 tokens depending on the tokenizer.
Prefill. The tokens are passed through the model's layers in a single forward pass that produces a representation of the input and the first output token. This stage is compute-bound and highly parallelizable.
Autoregressive decode. The model generates each subsequent output token one at a time, with each new token conditioned on all previous tokens. This stage is memory-bandwidth bound rather than compute-bound, which is why it dominates the latency of long responses.
Detokenization. The output tokens are converted back into the visible text the user receives.

The asymmetry between prefill (fast, parallel) and decode (slow, sequential) is one of the reasons LLM inference has developed its own infrastructure specialization separate from traditional machine-learning model serving.

Inference vs training: the short version

Training and inference differ in almost every operational dimension. Training consumes a large dataset once and produces a set of weights; inference consumes a single input and produces a single output, repeatedly.

Dimension	Training	Inference
Frequency	Once per model version	Once per user request
Duration	Days to months	Milliseconds to seconds
Hardware	Tightly coupled GPU clusters with high inter-node bandwidth	Individual GPUs or small groups
Cost shape	Capital-like, paid upfront	Operating-like, proportional to usage
Who absorbs the cost	Model creator	Application team
Optimization priority	Model quality	Cost, latency, reliability

The implication for most teams building on top of foundation models is that training economics are a vendor's problem. Inference economics are the team's problem. Every architectural decision downstream of model selection is an inference decision: which provider to use, where to run, how to batch, whether to cache, when to fall back. A dedicated post in this series will cover the technical distinctions in more depth.

Types of inference workloads

The abstract definition covers every case where a trained model produces an output, but the workload profile varies enough across types that the main categories are worth naming separately.

Text generation (chat). The dominant inference workload today. A prompt enters, tokens are generated one at a time, and the response is streamed to the user. Typical latency is 200-800ms to first token and 40-80 tokens per second during decode. Cost is roughly proportional to input tokens plus output tokens weighted at 3-5x the input rate.

Reasoning and chain-of-thought. A variant of text generation in which the model produces a large block of internal thinking before its visible response. The cost profile is materially different. Reasoning workloads can generate 3-10x more output tokens than their visible answer suggests, and latency shifts from hundreds of milliseconds to seconds or minutes. Models like OpenAI's o-series, Anthropic's extended thinking modes, and DeepSeek-R1 operate in this category.

Embeddings. A text input produces a fixed-length vector rather than generated text. Used for search, retrieval-augmented generation (RAG), clustering, and deduplication. Latency is typically sub-100ms per request, and per-call cost is an order of magnitude lower than generative workloads. Embedding inference is what makes most modern search and memory systems possible.

Vision and multimodal. An image, video frame, or audio clip combined with text produces a text or structured output. Latency scales with input resolution, and cost depends on how many tokens the input decodes to. The same serving infrastructure handles these workloads, but tokenization is substantially more expensive than for plain text.

Batch inference. Any of the above workloads submitted asynchronously with tolerance for up-to-24-hour completion. Most major providers offer a batch tier at roughly 50% off list price. For pipelines that do not require real-time responses (overnight enrichment, offline evaluation, scheduled data preparation), batch inference is the single largest published discount most teams underuse.

Where inference actually runs

Running inference at production quality depends on three infrastructure layers that are invisible from the API surface but dominate the cost structure.

The GPU tier. Modern LLMs require GPUs with high memory capacity and bandwidth. NVIDIA H100 and H200 cards (80-141GB of HBM memory) are the current workhorses for large models; B200 is beginning to displace them for new deployments. Smaller models can run on A100 or L40S cards, and very small models on consumer cards. The choice of card affects both throughput and the minimum model size that can be served without sharding across multiple cards.

The serving engine. Raw GPU access is not sufficient. Running a model efficiently requires a serving engine that manages the KV cache, batches concurrent requests, and optimizes memory layout. The engines most commonly seen in production are vLLM (open-source, general purpose), TensorRT-LLM (NVIDIA-specific, high performance), and SGLang (open-source, optimized for structured generation). The serving engine alone can change effective throughput by 3-5x on the same hardware.

Regional deployment. Inference latency is dominated by compute time, but network round-trip adds 30-200ms on top. For user-facing workloads, serving from a region close to the user (or from a provider with multi-region routing) meaningfully affects perceived responsiveness.

Self-operating this stack is viable for a small number of teams with serious scale and dedicated infrastructure engineering. For everyone else, the infrastructure is purchased, and the question becomes which purchasing model fits the workload.

How teams access inference in practice

Three patterns cover almost all production inference access today.

Self-hosting on rented GPUs. The team rents GPU capacity (from AWS, Runpod, Lambda, or similar), deploys an open-weight model on a serving engine, and operates the full stack. This gives maximum control and can be the cheapest path at very high volume, but it requires infrastructure expertise most application teams do not want to build: KV cache tuning, autoscaling, failover, observability.

Direct provider APIs. The team calls a single provider's API: OpenAI for GPT, Anthropic for Claude, Together or Fireworks for open-weight models. The provider handles the infrastructure. This is the fastest path to a working product, and it is how most teams start. The tradeoff is concentration risk and lock-in on a single pricing structure. Every request's cost, latency, and reliability become whatever that provider's service delivers that day.

Unified inference platforms. The team calls a single API that routes to multiple underlying providers with fallback, observability, and cost controls built in. This preserves the speed-to-ship advantage of direct provider APIs while removing single-provider concentration risk and allowing workload-aware routing (cheap model for simple queries, capable model for complex ones). Inferbase sits in this category. We operate an OpenAI-compatible API that spans 200+ models across multiple underlying providers, with routing, fallback, and cost transparency as first-class concerns.

Access pattern	Speed to ship	Operational burden	Economical at	Concentration risk
Self-hosted on rented GPUs	Slow	High (full infra stack)	Very high volume	None
Direct provider API	Fast	Low	Low to medium volume	High (single vendor)
Unified inference platform	Fast	Low	Any volume, mixed workloads	Low (multi-provider)

The choice among these three is driven by scale, engineering bandwidth, and how much variance in cost and reliability across providers the team is willing to absorb directly.

Why inference is the ongoing cost center

Inference dominates the economics of AI-powered products because it is the only cost that scales with usage. A model's training cost is fixed. The provider (or the team, if self-training) pays it once. Every inference request, by contrast, consumes new compute. Ten users become ten inferences; one million users become one million inferences. The unit economics of an AI product are, in practice, unit economics of inference.

This structure has two consequences that tend to catch teams off guard. First, the cost of an AI feature grows directly with its success: a feature with poor adoption has a small inference bill, and a feature with strong adoption has a proportionally larger one. Second, every architectural choice that changes inference cost compounds with usage, whether it is model selection, prompt design, caching strategy, provider choice, or retry policy. A 20% reduction in per-request cost at the prototype stage becomes a 20% reduction in marginal cost forever.

Our post on the hidden costs of LLM APIs covers the specific factors that cause real inference bills to diverge from published list rates. The short version is that the headline $/M token figure describes a best-case single request, and production workloads are rarely in that case.

What this unlocks

Inference is the layer where every abstract decision about AI (which model, which provider, which architecture) collapses into a concrete per-request cost, latency, and reliability profile. Teams that treat inference as an operational concern from the start make architectural choices that scale; teams that treat it as a downstream implementation detail tend to rebuild their infrastructure once their product finds traction.

The practical step beyond understanding the term is running real workloads against real models. Our playground provides side-by-side model comparison without a deployment commitment, and the unified inference API handles the routing, fallback, and observability that production inference workloads require. For teams still choosing a model, our guide to choosing the right AI model walks through the workload-to-model mapping we use internally.

What is AI Inference? A Complete Guide

What inference actually means

Inference vs training: the short version

Types of inference workloads

Where inference actually runs

How teams access inference in practice

Why inference is the ongoing cost center

What this unlocks

Frequently asked questions

Have thoughts on this article?

Related Articles

AI Inference vs Training: The Technical and Economic Differences

What Are Reasoning Models? How Test-Time Compute Works and Why It Costs More

What Is Model Routing? Matching Every Request to the Right Model

Stay up to date

Start building with the right model.