Most of what people notice about a language model happens at the edges: the prompt typed into a box and the fluent text that comes back. Between those two visible moments sits a step that almost no one thinks about, and yet every property of the model, from its price to the length of input it accepts, is shaped by it. A model does not read letters, and it does not read words. Before any reasoning can happen, the text has to be cut into the units the model was actually trained on. That cutting step is tokenization.
A token is the basic unit of text a model reads and writes, and for modern models it is usually a subword: a fragment that may be a whole common word, a piece of a longer one, or a single punctuation mark. The component that performs the split, the tokenizer, is a separate piece of software with its own fixed vocabulary, decided before training and frozen afterward. This post builds on the guide to AI inference and explains the layer that quietly determines what a model can read, how much it costs, and why two prompts of equal length can carry very different bills.
What a token actually is
The instinct is to assume a model reads words, because words are how people think about language. In practice the unit is smaller and stranger. A tokenizer holds a vocabulary of fixed entries, often between thirty thousand and a few hundred thousand of them, and its job is to express any input as a sequence drawn from that vocabulary. Frequent words usually earn their own entry, so "the" or "model" map to a single token. Less frequent words get assembled from pieces, so "tokenization" might become "token" followed by "ization," and an unusual name might fragment into several parts.
What the model receives is not the text itself but a list of integers, one per token, each indexing into that vocabulary. The string "How are you" is not three words to the model; it is a short list of identifiers the tokenizer produced. A useful rough conversion for English is that a token averages about four characters, or roughly three-quarters of a word, so one hundred tokens cover something like seventy-five words. That ratio is only an average, and it shifts with the vocabulary, the language, and the kind of text involved.
Why models settled on subwords
Subword tokenization is the answer to a problem with two tempting but broken alternatives. The first is to give every word its own token. This reads naturally but collapses the moment the model meets a word it never saw in training, and there are always such words: misspellings, new product names, technical jargon, rare inflections. A pure word vocabulary has no way to represent them except as a generic "unknown" placeholder, which throws away information the model needs.
The opposite extreme is to treat every character as a token. This never encounters an unknown input, since any text is just a sequence of characters, but it makes sequences punishingly long and forces the model to rebuild every word from its letters each time. Longer sequences cost more to process and leave less room for actual content. Subword tokenization threads the needle between these failures. Common words stay whole for efficiency, rare words decompose into reusable fragments so nothing is ever fully out of vocabulary, and the total vocabulary stays small enough to be practical. Almost every current model uses some form of this compromise.
How a tokenizer is built
A tokenizer is not designed by hand; its vocabulary is learned from data, much like the model it serves. The most common recipe is byte pair encoding, an algorithm originally from data compression that was adapted for language. It starts with the smallest possible units and repeatedly looks for the most frequent adjacent pair, merging that pair into a new single token. Run thousands of times over a large corpus, this process discovers that common letter sequences deserve to be tokens of their own, building up from characters to subwords to whole frequent words. The merge list it produces is the vocabulary.
Several variants of this idea power the models in wide use. Google's WordPiece merges by a likelihood criterion rather than raw frequency and underlies the BERT family. SentencePiece treats the input as a raw stream including spaces, which lets it work uniformly across languages that do not separate words with whitespace. Modern systems often operate on raw bytes rather than characters, an approach introduced with GPT-2, so that any possible input, in any language or symbol set, can always be encoded without an unknown token. The important consequence is that each model family ships its own tokenizer, and a given string can produce different token counts depending on which one you use.
Tokens, cost, and the bill you actually pay
Tokenization stops being an abstraction the moment you look at pricing. Nearly every hosted model charges per token, and it charges separately for the tokens you send and the tokens it generates, usually billing output at a higher rate than input. Your monthly cost is therefore a direct function of how your text tokenizes, not how long it looks on screen. A prompt padded with rare words, unusual formatting, or repeated boilerplate carries more tokens, and more tokens mean more money, on every single request.
This is where token awareness turns into real savings. The same information can often be expressed in fewer tokens through tighter prompts, shorter system messages, and trimmed context, and those reductions compound across high request volumes. Reasoning about tokens is the foundation under most LLM cost optimization work, and it sits close to the larger lesson that the hidden costs of LLM APIs usually live in places the pricing page does not advertise. Counting tokens before you ship a prompt is the cheapest optimization available, because it changes the cost of every call that follows.
Tokens and the context window
The second place tokens govern behavior is the limit on how much a model can consider at once. A model's context window is measured in tokens, not words or pages, and it bounds the combined size of your input and the model's output. When a request approaches that limit, it is the token count that matters, which is why a document that looks comfortably short can still overflow if it tokenizes into more pieces than expected.
This connects cost and capacity into one variable. Every token of input both consumes budget and occupies space in the window, so verbose prompts are doubly expensive: they cost more and they crowd out room the model could use for retrieved context or a longer answer. Teams building retrieval pipelines feel this directly, since the passages fetched by embedding similarity have to fit alongside the question and the response inside the same token budget. Managing the window is, in the end, managing tokens.
Where tokenization quietly bites
Because tokenization is invisible, its consequences tend to surface as puzzling model behavior rather than obvious errors. The most economically significant is the language gap. Most popular tokenizers were trained on text that is overwhelmingly English, so English maps to relatively few tokens while many other languages fragment into far more. The same sentence in Hindi, Thai, or Burmese can require several times the tokens of its English equivalent, which means non-English users pay more and exhaust the context window faster for identical meaning. Larger multilingual vocabularies have narrowed this gap but rarely close it.
The other consequences are about reasoning, not cost. Since the model works in tokens, tasks that hinge on individual characters become deceptively hard. Counting the letters in a word, reversing a string, or performing exact arithmetic can fail because the relevant characters are bundled inside larger tokens the model cannot easily pull apart, a class of failure documented in studies of character-level understanding. How a tokenizer splits numbers influences arithmetic reliability, and how it segments code affects programming accuracy. None of these reflect a flaw in the model's reasoning so much as the limits of the unit it is forced to think in.
What this means in practice
Tokenization is the layer where text becomes something a model can process, and almost every practical property of working with language models traces back to it. Price is denominated in tokens. The context limit is denominated in tokens. The strange edges of model behavior, from the multilingual cost penalty to the trouble with counting letters, follow from the choice to reason over subwords rather than words or characters. Understanding tokens is what turns these from surprises into things you can anticipate and plan around.
The useful habit is to make token counts visible before they cost you. Most providers publish a tokenizer or a counting tool, such as OpenAI's tiktoken, and running representative prompts through one shows exactly how your text decomposes and where the tokens accumulate. From there the optimizations are concrete: trim prompts, watch the window, and weigh how a given model tokenizes your particular workload. The Inferbase model catalog tracks the open model families you can run behind a single endpoint, and the unified inference API lets you serve them through one integration while you measure how each one turns your text into tokens.