The competitive pressure on context window length has produced some genuinely strange numbers over the past two years. A model that shipped in 2022 with an eight-thousand-token window now feels archaic against frontier offerings that advertise one million, two million, and in the Llama 4 Scout release, ten million tokens of context. The headline numbers have grown an order of magnitude every twelve to eighteen months, the marketing slides have grown taller alongside them, and the public assumption has quietly settled into a comfortable belief that long-context models have solved the document-stuffing problem and made retrieval architectures optional.
The practical reality is more careful. A context window is, in plain terms, the maximum number of tokens a language model can attend to inside a single forward pass, counting both the input prompt and the output it generates. The number printed on the model card describes the theoretical ceiling, the position at which the model will reject further input. It does not describe the length at which the model retrieves accurately, reasons reliably, or stays within a usable cost and latency envelope. This post (built on the foundational guide to AI inference, the explanation of retrieval-augmented generation, and the analysis of KV cache memory) walks through what a context window is, why the advertised length and the usable length diverge, and how to size a context budget for production workloads.
What a context window actually is
A context window is the working memory of a language model at inference time. When a model receives a prompt, it tokenizes the input, places each token in a position, and computes an attention map across the entire sequence. Every token can attend to every prior token, and the model uses that map to predict the next token. The context window is the upper bound on how long that sequence can be. A model with a 200,000-token context window can hold roughly 150,000 words of input and output in a single request, the length of a short novel. A model with a 2,000,000-token window can in principle hold the entirety of a mid-sized technical manual, several quarters of legal filings, or a software project of moderate size.
The mechanism is a property of the transformer architecture rather than of any particular model. A transformer's self-attention layer computes a similarity score between every pair of tokens in the sequence, and that pairwise comparison is what gives the model its ability to relate one part of the input to another. The context window is, structurally, the number of token positions for which the model has been trained and configured to compute those comparisons. Tokens outside the window are simply not part of the model's working state on that request.
Two clarifications matter from the start. First, the context window covers both the input prompt and the output generation in a single accounting. A request with 195,000 input tokens against a 200,000-token model has 5,000 tokens of output budget left before the model has to truncate or stop. Second, the context window is per-request: the model has no memory across requests by default, and any conversation history that the application wants the model to consider has to be re-sent inside the context budget of every new request. The illusion of long-running memory in chat products is implemented by sending more context per request, or by retrieving from an external memory and inserting it back into the prompt.
How context is measured
A token is not the same thing as a word, and different model families use different tokenizers on top of that, so the same string of text produces different counts on different models. OpenAI's GPT family uses byte-pair encoding with a vocabulary of around 100,000 to 200,000 tokens, Anthropic's Claude family uses a separate tokenizer with similar but not identical behavior, and Google's Gemini, Meta's Llama, and Alibaba's Qwen each maintain their own. As a useful first approximation, a token is around three-quarters of an English word, or four characters of English text. A 200,000-token Claude window holds roughly 150,000 English words. The same window holding Japanese, Chinese, or Korean text holds substantially fewer characters but a similar count of tokens, once the tokenizer's UTF-8 fallback expands multi-byte characters into longer sequences.
Token count also depends on the content type. Code tokenizes denser than English prose because tokenizers see common identifiers and keywords as single tokens, JSON and XML tend to tokenize lighter, and mixed-language documents with rare characters tokenize heavier. Production accounting always uses the model's actual tokenizer rather than a character or word estimate. OpenAI publishes the tiktoken library, Anthropic publishes a tokenizer endpoint, and the open-weight families ship their tokenizers alongside the weights.
The headline context-window number always counts input tokens plus output tokens together, but many providers also publish a separate cap on the output portion. Claude 4 models allow up to 32,000 tokens of output inside a 200,000-token total window; GPT-5.4 publishes a similar 16,000-token output cap inside its 400,000-token window. The lower output cap is partly a quality choice, since long generations tend to degrade as the model loses coherence, and partly a billing choice, since output tokens cost four to six times more than input tokens and the per-request liability of an unbounded output is meaningful. For applications that need long generation, the output cap is the binding constraint, and the input portion of the context window is mostly slack.
Where the context limit comes from
Three structural mechanisms set the context window of a transformer.
The first is attention compute. Self-attention computes a similarity score between every pair of tokens in the input. For an input of N tokens, the attention matrix has N² entries. Doubling the context doubles the memory required to hold the attention map and quadruples the compute required to fill it. The quadratic scaling is the reason context windows did not casually grow during the early years of the transformer era. Various optimizations (flash attention, sparse attention, sliding-window attention) have softened the constant but not changed the fundamental shape of the scaling, and the curve is still what makes a long-context forward pass meaningfully slower and more expensive than a short one.
The second is positional encoding. A transformer has no built-in notion of token order, so the model has to be told where each token sits in the sequence. Positional encodings inject that information, and the schemes commonly used in 2026 (rotary embeddings, ALiBi, and their variants) are trained over a specific range of positions. A model trained on positions zero through eight thousand has no calibrated behavior for position 100,000. Extending the trained range either requires fresh training (expensive) or a clever interpolation scheme such as RoPE scaling, position interpolation, or YaRN, which compresses the existing positional encoding into a wider effective range. The interpolation schemes have made long-context training tractable, but every extension carries some quality cost relative to the model trained at the same length from scratch.
The third is training distribution. A model's effective context length is bounded by the document lengths it saw during training. Even with a positional encoding that nominally supports 1,000,000 tokens, a model trained predominantly on documents under 10,000 tokens will retrieve and reason poorly across the longer range. The current frontier-class long-context training pipelines (notably the techniques used in Gemini 1.5 and the 2.x line and in the Llama 4 family) curate long-document corpora and synthetic long-document tasks specifically to give the model real attention over its full window. The quality of that training is what separates an advertised long-context model that actually retrieves accurately from one that simply does not reject the input.
The cost of context
A long context is not free. Every additional token in the prompt costs time, memory, and money, in ways that compound differently across each dimension.
The memory cost is the KV cache. Every token a model processes leaves behind a key and value vector at every layer, and those vectors persist for the duration of the request because subsequent tokens have to attend to them. A 200,000-token request with a 70B-class dense model carries a KV cache on the order of tens of gigabytes per request. The cache scales linearly with context length and with batch size, and it dominates the per-request memory footprint on long-context inference. The analysis of KV cache memory walks through the specific formula and the modern compressions (paged attention, FP8 KV cache, grouped-query attention) that have reduced the constant. The shape of the scaling has not changed.
The compute cost is dominated by prefill. The first decode step has to attend to every input token, which means the prefill phase processes the entire prompt in a single pass. Prefill compute scales linearly with context length up to a flash-attention crossover, and then quadratically once attention dominates. Time-to-first-token (TTFT) follows the same shape. A 1,000-token prompt typically produces a sub-second TTFT on a frontier serving stack, and a 100,000-token prompt against the same model produces a TTFT in the range of three to ten seconds, depending on the provider's batching and prefill optimization. Long-context requests are not interactive-grade out of the box.
The dollar cost is set by per-token billing. A request with a 100,000-token system prompt and a 200-token user turn against Claude Sonnet 4.6 costs roughly $0.30 in input tokens alone, before any output is generated. The same request issued a hundred times a day accumulates to $900 per month of input billing for that single application. Frontier providers have responded to this with cache discounts: a system prompt sent repeatedly across requests can be cached at the provider's edge, and the cached portion is billed at roughly one-tenth of the base input rate. The enterprise pricing audit and the framework on hidden API costs cover the caching mechanics in detail. The short version is that cache-friendly prompt structure becomes the default once context budgets cross a few tens of thousands of tokens.
The three costs interact. A long-context workload running at high queries per second hits the memory ceiling of the GPU before it hits the published throughput ceiling of the model. A long-context workload running interactively hits the TTFT ceiling of acceptable user experience before it hits the cost ceiling. A long-context workload running in batch hits the dollar ceiling before either. The right context budget for a given application is set by whichever ceiling binds first, not by what the model nominally supports.
Published length versus usable length
The published context window names the input length the model will accept. The usable length, where retrieval and reasoning still hold up under load, is consistently shorter.
The most-cited evidence on this gap comes from the Lost in the Middle study (Liu et al., 2023), which evaluated retrieval accuracy as a function of the position of the relevant passage inside a long context. The finding, robust across model families and replicated many times since, is that retrieval accuracy is U-shaped over the position axis. Models retrieve well from the start of the context (a recency-of-instruction bias), retrieve well from the end of the context (a proximity-to-generation bias), and degrade meaningfully in the middle. The shape has softened on frontier 2026 models but has not disappeared. A document buried in the middle of a 100,000-token prompt is materially less likely to be cited correctly than the same document placed near the start or end.
The other widely used long-context benchmark is RULER, NVIDIA's needle-in-a-haystack-plus extension, which tests multi-hop retrieval, aggregation, and tracking across long contexts. RULER results are typically lower than the headline NIAH (needle in a haystack) numbers that providers advertise, because the underlying tasks are harder than spotting a planted sentence in irrelevant text. Public RULER scores for frontier 2026 models commonly hold near-perfect retrieval through roughly 32,000 tokens, degrade gradually through 128,000, and drop sharply above 256,000. The advertised 1M and 2M windows are technically accessible (the model will produce an answer) but the empirical accuracy across the full window varies more than the marketing implies.
The third dimension is reasoning rather than retrieval. Even when a model retrieves the right passage from a long context, the quality of multi-step reasoning over a large context is consistently lower than reasoning over a focused context with the same content. The pattern shows up cleanly on long-document question answering where the correct answer requires combining information from multiple sections. A frontier model handed the exact relevant passages inside a 4,000-token context routinely outperforms the same model handed the entire document inside its full window. Long context preserves recall; it does not preserve focused reasoning to the same degree.
The practical takeaway is that the headline context length is the boundary at which the model will reject input. The usable length is the boundary at which retrieval and reasoning still meet the quality bar of the application. The two numbers are usually different by a factor of two to ten.
The shape of degradation
Degradation across the context length is not a single curve. Different aspects of quality decay at different rates, and an application's effective length is set by whichever curve binds first.
Retrieval (finding a specific fact in the context) holds up well into the tens of thousands of tokens and degrades gradually beyond that on frontier models. Provider-published NIAH scores are dominated by recall at this level and are typically the strongest of the long-context metrics.
Multi-hop reasoning (combining facts from different parts of the context) degrades faster than retrieval, particularly when the relevant facts sit far apart in the prompt. The degradation often shows up as confident but wrong synthesis: the model retrieves both facts but combines them incorrectly.
Instruction adherence decays in a third way. A long context with a complex system prompt at the start tends to see the model drift away from the system instructions as the prompt grows. The drift is particularly visible in production agent workloads, where the system prompt encodes tool-use rules and safety constraints, and the model gradually relaxes those constraints as the conversation history accumulates.
Generation quality declines on long-context tasks even when the input fits comfortably. The reasoning chain produced by the model on a 100,000-token input is typically shorter and less precise than the chain on a focused 4,000-token input with the same essential content. The model has less working capacity available for the actual reasoning step, and the effect compounds with the multi-hop weakness above.
The composite effect is that the practical effective length of a model varies by task. For a question-answering system that needs precise retrieval, the binding curve is retrieval and the effective length is generous. For a multi-step agent that has to coordinate tool calls and reasoning across a long history, the binding curve is instruction adherence and the effective length is much shorter, often a fraction of the advertised window.
Long context, retrieval, and caching
The three architectural responses to bounded context are not interchangeable. They solve different problems and have different operating envelopes.
Long context is the right answer when the application needs the model to attend to a coherent body of source material in a single pass, and when the source material fits comfortably inside the usable (not just the advertised) length. Single-document analysis, fresh-codebase ingestion, and detailed contract review at modest sizes all fit this shape. The cost is the per-request bill, which scales linearly with prompt size, and the latency, which scales worse.
Retrieval-augmented generation (RAG) is the right answer when the corpus is larger than the usable window, when the corpus changes faster than the model can be retrained, or when the application has to cite specific source passages for verification. RAG keeps the working context small and lets the system scale corpus size independently of context length. The foundational RAG guide covers the architecture in detail. The trade-off is engineering complexity (an indexing pipeline, an embedding model, a vector store, often a reranker) and retrieval quality (the model can only see what the retriever surfaces).
Prompt caching is the right answer when a large prompt prefix is reused across many requests. A long system prompt or a static document at the start of every request can be cached at the provider's edge, billed at roughly one-tenth of the base input rate, and (for Claude 4.x models) exempted from the input-tokens-per-minute rate limit ceiling. Caching does not solve the underlying long-context problem (the model is still asked to attend to the full context) but it does lower the marginal cost of doing so when the prefix is stable.
The architectural decision among the three is often misframed as "long context replaces RAG" or "caching replaces long context", neither of which is accurate. In production, stacks typically combine all three: a stable system prompt is cached, dynamic source material is retrieved into a moderate-length context window, and the model operates inside a usable rather than advertised length budget. A rough decision sketch: if the corpus is under 30,000 tokens and is stable across requests, long context with caching wins. If the corpus is under 30,000 tokens and changes per request, plain long context is fine until cost forces a re-evaluation. If the corpus exceeds 30,000 tokens or changes per session, RAG is almost always the right architecture, and the model's context window only needs to be long enough to hold the retrieved passages plus the conversation state.
Sizing context budgets in production
The most common production mistake on context-window sizing is to design the prompt against the advertised length and to assume the model will hold up. The more reliable pattern is to set the prompt budget against the usable length, leave headroom for the output, and treat anything beyond that as a candidate for retrieval rather than for stuffing.
A working production budget on a 200,000-token frontier model typically looks something like:
- 4,000 to 16,000 tokens of system prompt (with prefix caching enabled)
- 8,000 to 32,000 tokens of dynamic context (retrieved passages, recent conversation, tool-call results)
- 4,000 to 16,000 tokens of output budget reserved
- The remaining nominal capacity treated as slack
The same model nominally supports 200,000 tokens, but the budgets above use roughly 50,000 to 70,000 of effective context per request, which is where retrieval accuracy and reasoning quality are still solid and where the per-request cost stays predictable. Pushing into the 100,000-to-200,000 range happens for specific use cases (single-document analysis, codebase queries, long-form summarization) and is treated as the exception rather than the default.
The same logic flips on 1M and 2M models. The nominal extension does not move the usable length by the same factor; on most production tasks the usable budget on a 2M model is in the same 100,000-to-300,000 range as on the previous-generation 200,000-token models, with the extra nominal length useful primarily as headroom for tasks that genuinely require it.
A second sizing rule worth internalizing: prompt-cache hit rate is more leveraged than raw context length in most production workloads. A 16,000-token static system prompt with a 95% cache hit rate is cheaper per request than a 4,000-token dynamic prompt with no caching, and the cached prefix usually does not consume input-tokens-per-minute rate-limit budget. Designing the prompt around cacheability often produces a better cost and latency profile than designing it around minimum size.
Where the field is heading
Three threads are visible in 2026.
The first is continued scaling of nominal context length, driven by the competitive pressure that has produced the 1M, 2M, and 10M numbers. The scaling is real in the sense that the models will accept longer input. The gap between accepted length and usable length is shrinking, but it is still material on most production tasks. Headlines are likely to keep moving faster than the underlying capability.
The second is architectural variants that change the cost shape of long context. State-space models, hybrid transformer-SSM stacks, and various forms of sparse and sliding-window attention all reduce the quadratic compute cost at long context. The current production frontier is still dominated by dense transformers because the quality at the frontier scale has not yet been matched by the alternative architectures, but the gap is closing and the cost shape will change once it does.
The third is caching and persistence as a first-class API surface. Anthropic, OpenAI, Google, and DeepSeek have all moved prompt caching from an internal optimization to an explicit billing tier over the past eighteen months, and the next step (persistent conversation state cached across requests for hours or days) is already in preview at multiple providers. Persistent caching, if it lands at frontier providers with reasonable cost discounts, would substantially reduce the practical pressure on raw context length, because much of what currently sits inside a long context could move into a cached persistent state outside the per-request token budget.
The underlying point through all of this is that the context window is not a single number, and the production decision is not "how long a window do I need". The window length sets the upper bound. The usable length is where the application actually operates. The cost, latency, and quality envelopes are set by the combination of window, content distribution, and surrounding caching and retrieval architecture. Treating the headline number as the operating budget is the most common way teams over-provision and under-deliver on long-context production workloads.