AI inference is the process by which a trained machine learning model produces an output in response to new input. When a chatbot answers a question, a search system returns a ranked result, or an image model generates a picture, inference is the step that turns a saved set of model weights into the response the user sees. Training built the model once; inference runs it every time someone uses it.
The term has moved from infrastructure-paper jargon into general engineering vocabulary for a specific reason. Models became useful enough that serving them at scale became a distinct problem class, and the cost, latency, and reliability characteristics of that serving layer now dominate the economics of any AI-powered product. The GPU cluster that trained a frontier model runs once. The inference layer that answers every user request runs forever.
What inference actually means
The cleanest way to think about inference is as the runtime execution of a frozen model. During training, the model's weights change as it learns from examples. Once training finishes, those weights are saved. Inference is the phase that takes those fixed weights, combines them with a new input, and computes an output. Nothing about the model itself changes during inference. It is a read-only forward computation.
For large language models specifically, a single inference request decomposes into four stages that each impose their own cost and latency profile:
- Tokenization. The input text is split into tokens (units roughly the size of a syllable or short word) that the model's vocabulary can represent numerically. A 500-word document becomes somewhere between 600 and 800 tokens depending on the tokenizer.
- Prefill. The tokens are passed through the model's layers in a single forward pass that produces a representation of the input and the first output token. This stage is compute-bound and highly parallelizable.
- Autoregressive decode. The model generates each subsequent output token one at a time, with each new token conditioned on all previous tokens. This stage is memory-bandwidth bound rather than compute-bound, which is why it dominates the latency of long responses.
- Detokenization. The output tokens are converted back into the visible text the user receives.
The asymmetry between prefill (fast, parallel) and decode (slow, sequential) is one of the reasons LLM inference has developed its own infrastructure specialization separate from traditional machine-learning model serving.
Inference vs training: the short version
Training and inference differ in almost every operational dimension. Training consumes a large dataset once and produces a set of weights; inference consumes a single input and produces a single output, repeatedly.
| Dimension | Training | Inference |
|---|---|---|
| Frequency | Once per model version | Once per user request |
| Duration | Days to months | Milliseconds to seconds |
| Hardware | Tightly coupled GPU clusters with high inter-node bandwidth | Individual GPUs or small groups |
| Cost shape | Capital-like, paid upfront | Operating-like, proportional to usage |
| Who absorbs the cost | Model creator | Application team |
| Optimization priority | Model quality | Cost, latency, reliability |
The implication for most teams building on top of foundation models is that training economics are a vendor's problem. Inference economics are the team's problem. Every architectural decision downstream of model selection is an inference decision: which provider to use, where to run, how to batch, whether to cache, when to fall back. A dedicated post in this series will cover the technical distinctions in more depth.
Types of inference workloads
The abstract definition covers every case where a trained model produces an output, but the workload profile varies enough across types that the main categories are worth naming separately.
Text generation (chat). The dominant inference workload today. A prompt enters, tokens are generated one at a time, and the response is streamed to the user. Typical latency is 200-800ms to first token and 40-80 tokens per second during decode. Cost is roughly proportional to input tokens plus output tokens weighted at 3-5x the input rate.
Reasoning and chain-of-thought. A variant of text generation in which the model produces a large block of internal thinking before its visible response. The cost profile is materially different. Reasoning workloads can generate 3-10x more output tokens than their visible answer suggests, and latency shifts from hundreds of milliseconds to seconds or minutes. Models like OpenAI's o-series, Anthropic's extended thinking modes, and DeepSeek-R1 operate in this category.
Embeddings. A text input produces a fixed-length vector rather than generated text. Used for search, retrieval-augmented generation (RAG), clustering, and deduplication. Latency is typically sub-100ms per request, and per-call cost is an order of magnitude lower than generative workloads. Embedding inference is what makes most modern search and memory systems possible.
Vision and multimodal. An image, video frame, or audio clip combined with text produces a text or structured output. Latency scales with input resolution, and cost depends on how many tokens the input decodes to. The same serving infrastructure handles these workloads, but tokenization is substantially more expensive than for plain text.
Batch inference. Any of the above workloads submitted asynchronously with tolerance for up-to-24-hour completion. Most major providers offer a batch tier at roughly 50% off list price. For pipelines that do not require real-time responses (overnight enrichment, offline evaluation, scheduled data preparation), batch inference is the single largest published discount most teams underuse.
Where inference actually runs
Running inference at production quality depends on three infrastructure layers that are invisible from the API surface but dominate the cost structure.
The GPU tier. Modern LLMs require GPUs with high memory capacity and bandwidth. NVIDIA H100 and H200 cards (80-141GB of HBM memory) are the current workhorses for large models; B200 is beginning to displace them for new deployments. Smaller models can run on A100 or L40S cards, and very small models on consumer cards. The choice of card affects both throughput and the minimum model size that can be served without sharding across multiple cards.
The serving engine. Raw GPU access is not sufficient. Running a model efficiently requires a serving engine that manages the KV cache, batches concurrent requests, and optimizes memory layout. The engines most commonly seen in production are vLLM (open-source, general purpose), TensorRT-LLM (NVIDIA-specific, high performance), and SGLang (open-source, optimized for structured generation). The serving engine alone can change effective throughput by 3-5x on the same hardware.
Regional deployment. Inference latency is dominated by compute time, but network round-trip adds 30-200ms on top. For user-facing workloads, serving from a region close to the user (or from a provider with multi-region routing) meaningfully affects perceived responsiveness.
Self-operating this stack is viable for a small number of teams with serious scale and dedicated infrastructure engineering. For everyone else, the infrastructure is purchased, and the question becomes which purchasing model fits the workload.
How teams access inference in practice
Three patterns cover almost all production inference access today.
Self-hosting on rented GPUs. The team rents GPU capacity (from AWS, Runpod, Lambda, or similar), deploys an open-weight model on a serving engine, and operates the full stack. This gives maximum control and can be the cheapest path at very high volume, but it requires infrastructure expertise most application teams do not want to build: KV cache tuning, autoscaling, failover, observability.
Direct provider APIs. The team calls a single provider's API: OpenAI for GPT, Anthropic for Claude, Together or Fireworks for open-weight models. The provider handles the infrastructure. This is the fastest path to a working product, and it is how most teams start. The tradeoff is concentration risk and lock-in on a single pricing structure. Every request's cost, latency, and reliability become whatever that provider's service delivers that day.
Unified inference platforms. The team calls a single API that routes to multiple underlying providers with fallback, observability, and cost controls built in. This preserves the speed-to-ship advantage of direct provider APIs while removing single-provider concentration risk and allowing workload-aware routing (cheap model for simple queries, capable model for complex ones). Inferbase sits in this category. We operate an OpenAI-compatible API that spans 200+ models across multiple underlying providers, with routing, fallback, and cost transparency as first-class concerns.
| Access pattern | Speed to ship | Operational burden | Economical at | Concentration risk |
|---|---|---|---|---|
| Self-hosted on rented GPUs | Slow | High (full infra stack) | Very high volume | None |
| Direct provider API | Fast | Low | Low to medium volume | High (single vendor) |
| Unified inference platform | Fast | Low | Any volume, mixed workloads | Low (multi-provider) |
The choice among these three is driven by scale, engineering bandwidth, and how much variance in cost and reliability across providers the team is willing to absorb directly.
Why inference is the ongoing cost center
Inference dominates the economics of AI-powered products because it is the only cost that scales with usage. A model's training cost is fixed. The provider (or the team, if self-training) pays it once. Every inference request, by contrast, consumes new compute. Ten users become ten inferences; one million users become one million inferences. The unit economics of an AI product are, in practice, unit economics of inference.
This structure has two consequences that tend to catch teams off guard. First, the cost of an AI feature grows directly with its success: a feature with poor adoption has a small inference bill, and a feature with strong adoption has a proportionally larger one. Second, every architectural choice that changes inference cost compounds with usage, whether it is model selection, prompt design, caching strategy, provider choice, or retry policy. A 20% reduction in per-request cost at the prototype stage becomes a 20% reduction in marginal cost forever.
Our post on the hidden costs of LLM APIs covers the specific factors that cause real inference bills to diverge from published list rates. The short version is that the headline $/M token figure describes a best-case single request, and production workloads are rarely in that case.
What this unlocks
Inference is the layer where every abstract decision about AI (which model, which provider, which architecture) collapses into a concrete per-request cost, latency, and reliability profile. Teams that treat inference as an operational concern from the start make architectural choices that scale; teams that treat it as a downstream implementation detail tend to rebuild their infrastructure once their product finds traction.
The practical step beyond understanding the term is running real workloads against real models. Our playground provides side-by-side model comparison without a deployment commitment, and the unified inference API handles the routing, fallback, and observability that production inference workloads require. For teams still choosing a model, our guide to choosing the right AI model walks through the workload-to-model mapping we use internally.