What is a token in a language model?

A token is the basic unit of text that a language model reads and writes. It is usually a subword: a common word like 'the' is often a single token, while a longer or rarer word is split into several pieces, such as 'tokenization' becoming 'token' and 'ization'. Tokens are not the same as words or characters. On average, English text runs around four characters or three-quarters of a word per token, so a rough rule is that one hundred tokens cover roughly seventy-five words. The model never sees raw letters; it sees a sequence of token identifiers, each an integer that indexes into a fixed vocabulary the tokenizer learned in advance.

Why do language models use tokens instead of words or characters?

Subword tokens are a compromise between two extremes that both fail. A word-level vocabulary cannot represent words it never saw during training, so every typo, new term, or rare name becomes an unknown placeholder. A character-level vocabulary handles any input but makes sequences very long and forces the model to relearn spelling from scratch, which is slow and wasteful. Subword tokenization keeps frequent words whole for efficiency and breaks rare words into reusable pieces so nothing is ever truly out of vocabulary. The result is a manageable vocabulary, typically tens of thousands of tokens, that covers any text while keeping sequences short enough to process affordably.

How are tokens counted, and why does it matter for cost?

A tokenizer converts your text into a list of token identifiers, and the length of that list is the token count. It matters because nearly every hosted model is priced per token, with separate rates for the tokens you send (input) and the tokens the model generates (output). Two prompts of the same character length can cost different amounts if one uses rarer words or a language that the tokenizer splits more aggressively. Token count also determines how much of the context window a request consumes, so understanding it is the first step in controlling both the bill and whether a long input even fits.

Why does the same sentence cost more in some languages than in English?

Most widely used tokenizers were trained on text that is predominantly English, so English words tend to map to few tokens while text in other languages is split into many more pieces. The same sentence in a language like Hindi, Thai, or Burmese can require several times as many tokens as its English equivalent, even though it conveys identical meaning. Because pricing and context limits are measured in tokens, this means non-English usage can cost more and exhaust the context window faster. Newer tokenizers with larger, more multilingual vocabularies narrow the gap but rarely eliminate it.

What Is Tokenization? How Language Models Read Text

Q: Does tokenization affect how well a model performs?

Yes, in subtle but real ways. Because the model reasons over tokens rather than characters, tasks that depend on individual letters or digits can be harder than they look. Counting the letters in a word, reversing a string, or doing precise arithmetic can fail partly because the relevant characters are bundled inside larger tokens the model cannot easily decompose. Tokenization also shapes how numbers are split, which influences arithmetic reliability, and how code is segmented, which affects programming tasks. These are not flaws in reasoning so much as consequences of the unit the model is forced to work in.

Most of what people notice about a language model happens at the edges: the prompt typed into a box and the fluent text that comes back. Between those two visible moments sits a step that almost no one thinks about, and yet every property of the model, from its price to the length of input it accepts, is shaped by it. A model does not read letters, and it does not read words. Before any reasoning can happen, the text has to be cut into the units the model was actually trained on. That cutting step is tokenization.

A token is the basic unit of text a model reads and writes, and for modern models it is usually a subword: a fragment that may be a whole common word, a piece of a longer one, or a single punctuation mark. The component that performs the split, the tokenizer, is a separate piece of software with its own fixed vocabulary, decided before training and frozen afterward. This post builds on the guide to AI inference and explains the layer that quietly determines what a model can read, how much it costs, and why two prompts of equal length can carry very different bills.

What a token actually is

The instinct is to assume a model reads words, because words are how people think about language. In practice the unit is smaller and stranger. A tokenizer holds a vocabulary of fixed entries, often between thirty thousand and a few hundred thousand of them, and its job is to express any input as a sequence drawn from that vocabulary. Frequent words usually earn their own entry, so "the" or "model" map to a single token. Less frequent words get assembled from pieces, so "tokenization" might become "token" followed by "ization," and an unusual name might fragment into several parts.

What the model receives is not the text itself but a list of integers, one per token, each indexing into that vocabulary. The string "How are you" is not three words to the model; it is a short list of identifiers the tokenizer produced. A useful rough conversion for English is that a token averages about four characters, or roughly three-quarters of a word, so one hundred tokens cover something like seventy-five words. That ratio is only an average, and it shifts with the vocabulary, the language, and the kind of text involved.

Why models settled on subwords

Subword tokenization is the answer to a problem with two tempting but broken alternatives. The first is to give every word its own token. This reads naturally but collapses the moment the model meets a word it never saw in training, and there are always such words: misspellings, new product names, technical jargon, rare inflections. A pure word vocabulary has no way to represent them except as a generic "unknown" placeholder, which throws away information the model needs.

The opposite extreme is to treat every character as a token. This never encounters an unknown input, since any text is just a sequence of characters, but it makes sequences punishingly long and forces the model to rebuild every word from its letters each time. Longer sequences cost more to process and leave less room for actual content. Subword tokenization threads the needle between these failures. Common words stay whole for efficiency, rare words decompose into reusable fragments so nothing is ever fully out of vocabulary, and the total vocabulary stays small enough to be practical. Almost every current model uses some form of this compromise.

How a tokenizer is built

A tokenizer is not designed by hand; its vocabulary is learned from data, much like the model it serves. The most common recipe is byte pair encoding, an algorithm originally from data compression that was adapted for language. It starts with the smallest possible units and repeatedly looks for the most frequent adjacent pair, merging that pair into a new single token. Run thousands of times over a large corpus, this process discovers that common letter sequences deserve to be tokens of their own, building up from characters to subwords to whole frequent words. The merge list it produces is the vocabulary.

Several variants of this idea power the models in wide use. Google's WordPiece merges by a likelihood criterion rather than raw frequency and underlies the BERT family. SentencePiece treats the input as a raw stream including spaces, which lets it work uniformly across languages that do not separate words with whitespace. Modern systems often operate on raw bytes rather than characters, an approach introduced with GPT-2, so that any possible input, in any language or symbol set, can always be encoded without an unknown token. The important consequence is that each model family ships its own tokenizer, and a given string can produce different token counts depending on which one you use.

Tokens, cost, and the bill you actually pay

Tokenization stops being an abstraction the moment you look at pricing. Nearly every hosted model charges per token, and it charges separately for the tokens you send and the tokens it generates, usually billing output at a higher rate than input. Your monthly cost is therefore a direct function of how your text tokenizes, not how long it looks on screen. A prompt padded with rare words, unusual formatting, or repeated boilerplate carries more tokens, and more tokens mean more money, on every single request.

This is where token awareness turns into real savings. The same information can often be expressed in fewer tokens through tighter prompts, shorter system messages, and trimmed context, and those reductions compound across high request volumes. Reasoning about tokens is the foundation under most LLM cost optimization work, and it sits close to the larger lesson that the hidden costs of LLM APIs usually live in places the pricing page does not advertise. Counting tokens before you ship a prompt is the cheapest optimization available, because it changes the cost of every call that follows.

Tokens and the context window

The second place tokens govern behavior is the limit on how much a model can consider at once. A model's context window is measured in tokens, not words or pages, and it bounds the combined size of your input and the model's output. When a request approaches that limit, it is the token count that matters, which is why a document that looks comfortably short can still overflow if it tokenizes into more pieces than expected.

This connects cost and capacity into one variable. Every token of input both consumes budget and occupies space in the window, so verbose prompts are doubly expensive: they cost more and they crowd out room the model could use for retrieved context or a longer answer. Teams building retrieval pipelines feel this directly, since the passages fetched by embedding similarity have to fit alongside the question and the response inside the same token budget. Managing the window is, in the end, managing tokens.

Where tokenization quietly bites

Because tokenization is invisible, its consequences tend to surface as puzzling model behavior rather than obvious errors. The most economically significant is the language gap. Most popular tokenizers were trained on text that is overwhelmingly English, so English maps to relatively few tokens while many other languages fragment into far more. The same sentence in Hindi, Thai, or Burmese can require several times the tokens of its English equivalent, which means non-English users pay more and exhaust the context window faster for identical meaning. Larger multilingual vocabularies have narrowed this gap but rarely close it.

The other consequences are about reasoning, not cost. Since the model works in tokens, tasks that hinge on individual characters become deceptively hard. Counting the letters in a word, reversing a string, or performing exact arithmetic can fail because the relevant characters are bundled inside larger tokens the model cannot easily pull apart, a class of failure documented in studies of character-level understanding. How a tokenizer splits numbers influences arithmetic reliability, and how it segments code affects programming accuracy. None of these reflect a flaw in the model's reasoning so much as the limits of the unit it is forced to think in.

What this means in practice

Tokenization is the layer where text becomes something a model can process, and almost every practical property of working with language models traces back to it. Price is denominated in tokens. The context limit is denominated in tokens. The strange edges of model behavior, from the multilingual cost penalty to the trouble with counting letters, follow from the choice to reason over subwords rather than words or characters. Understanding tokens is what turns these from surprises into things you can anticipate and plan around.

The useful habit is to make token counts visible before they cost you. Most providers publish a tokenizer or a counting tool, such as OpenAI's tiktoken, and running representative prompts through one shows exactly how your text decomposes and where the tokens accumulate. From there the optimizations are concrete: trim prompts, watch the window, and weigh how a given model tokenizes your particular workload. The Inferbase model catalog tracks the open model families you can run behind a single endpoint, and the unified inference API lets you serve them through one integration while you measure how each one turns your text into tokens.

What Is Tokenization? How Language Models Read Text

What a token actually is

Why models settled on subwords

How a tokenizer is built

Tokens, cost, and the bill you actually pay

Tokens and the context window

Where tokenization quietly bites

What this means in practice

Frequently asked questions

Have thoughts on this article?

Related Articles

What Is a Context Window? How LLM Context Limits Work and Why the Headline Number Misleads

What Are Embeddings? How Text Becomes Vectors of Meaning

What Is Model Routing? Matching Every Request to the Right Model

Stay up to date

Start building with the right model.