Most of the attention in artificial intelligence goes to models that produce something a person can read: a paragraph of text, an answer to a question, a block of code. Underneath those visible outputs sits a quieter layer that rarely makes a headline, and yet a large share of production systems would not function without it. Before a model can compare two documents, search a knowledge base, or decide that two support tickets are really the same complaint, the language has to be turned into numbers. That translation is the job of embeddings.
An embedding is a fixed-length list of numbers, a vector, that represents a piece of text as a single point in a high-dimensional space. The embedding model that produces it is trained so that the position of the point reflects meaning: texts that mean similar things are placed close together, and texts that mean different things land far apart. Once language lives in this space, the messy problem of comparing meaning becomes the tractable problem of measuring distance. This post builds on the guide to AI inference and explains the representation that quietly powers semantic search, retrieval, and much of what makes retrieval-augmented generation work.
What an embedding actually is
At its simplest, a text embedding is a vector of a few hundred to a few thousand numbers, where each number is a coordinate along one dimension of a learned space. A short query and a long document both map to a vector of the same length, so a 768-dimensional model turns every input, regardless of how many words it contains, into exactly 768 numbers. Common embedding sizes range from 384 dimensions for lightweight models to 1536 or 3072 for larger ones, and the choice trades representational capacity against storage and compute.
The numbers themselves are not interpretable one at a time. No single dimension corresponds neatly to a concept like "formality" or "topic," and reading the raw vector tells you nothing. What carries meaning is the position of the whole vector relative to others. The model has learned, across a large amount of text, to arrange inputs so that geometric closeness in the space corresponds to closeness in meaning. The vector is useful only in comparison, never on its own.
Why similar meaning lands in nearby vectors
The defining property of a good embedding space is that distance tracks meaning. Two sentences that say nearly the same thing in different words will produce vectors that sit close together, while two sentences on unrelated topics will produce vectors far apart. This is what lets a search system match a question to a relevant passage even when they share almost no vocabulary, because the match is computed on meaning rather than on overlapping words.
The most common way to measure that closeness is cosine similarity, which compares the angle between two vectors rather than the straight-line distance between their endpoints. A cosine of one means the vectors point in the same direction and the texts are judged maximally related, and a value near zero means they are essentially unrelated. Comparing orientation rather than magnitude matters in practice, because it keeps a short query and a long document comparable when they are about the same subject. The geometry that makes this work is learned from data, not designed by hand, which is both the source of its power and the reason it inherits the patterns and blind spots of the text it was trained on.
How embeddings are produced
Embeddings are generated by a dedicated embedding model, which is a separate component from the generative model that writes answers. The model reads the input text, runs it through its layers, and emits a single vector that summarizes the whole input. Most modern embedding models are transformers, the same architecture family behind large language models, but they are trained toward a different goal: producing representations that place similar inputs close together, rather than predicting the next word.
The lineage of this idea is worth a brief look, because it explains why current models behave the way they do. Early approaches such as Word2Vec and GloVe learned one fixed vector per word, which captured rough semantic relationships but could not tell apart the two senses of a word like "bank" because the vector was the same in every context. The shift to contextual models, marked by BERT, produced a different vector for a word depending on the sentence around it. To turn those per-token vectors into a single vector for a whole sentence or document, methods like Sentence-BERT pool the token representations and fine-tune the model specifically for similarity, which is the recipe most production embedding models still follow.
In practical terms, producing an embedding is a single forward pass. The text is tokenized, run through the model, pooled into one vector, and usually normalized to unit length so that cosine similarity behaves consistently. Because it is one pass with no generation step, embedding a piece of text is far cheaper than asking a generative model to reason about it, which is exactly what makes embedding large collections affordable.
What embeddings are used for
The reason embeddings appear in so many systems is that a surprising number of tasks reduce to "find things with similar meaning" or "group things by meaning." Once text is a point in space, those tasks become geometry.
- Semantic search. Embed the query and the documents, then return the documents whose vectors are closest to the query. Unlike keyword search, this finds relevant results that use none of the query's exact words.
- Retrieval-augmented generation. Retrieve the most relevant passages by embedding similarity and feed them to a generative model as context. Embeddings are the retrieval half of RAG, and the quality of the answer depends heavily on the quality of the retrieval.
- Clustering and deduplication. Group documents whose vectors fall near one another to discover topics, or flag near-duplicate content that shares meaning without sharing wording.
- Classification and recommendation. Compare an item's embedding against labeled examples or against a user's history to assign a category or surface related items.
The common thread is that the embedding does the work of understanding once, up front, and every downstream comparison is then a fast distance calculation. Searching millions of documents at query time is feasible precisely because the expensive part, encoding meaning into a vector, has already happened.
How embeddings differ from the generative model
It is easy to assume the embedding comes from the same model that answers questions, but the two are usually distinct and serve opposite purposes. A generative large language model is trained to continue text by predicting the next token, and its output is a sequence of words. An embedding model is trained to compress an input into one representative vector, and its output is a single point with no words attached. One produces language, the other produces a coordinate.
This difference shows up in cost and size. Embedding models are typically much smaller and faster than the generative models they support, which is what allows a system to embed an entire document store in advance and keep the heavyweight generative model in reserve for the final answer. The division of labor is deliberate: the cheap model handles the broad task of finding what is relevant, and the expensive model handles the narrow task of producing a response from it. Understanding where each model sits is part of reasoning about the cost of an inference pipeline, the same way choosing the right model per request is.
Choosing and using an embedding model
The practical decisions when adopting embeddings are fewer than for generative models, but they matter for both quality and cost. The table below covers the main ones.
| Consideration | What it affects | Practical guidance |
|---|---|---|
| Dimensions | Representational capacity, storage, query speed | More dimensions hold more information but cost more to store and compare. 384 to 1024 suits most uses; reach higher only if evaluation shows it helps. |
| Context length | How much text fits in one embedding | A longer context lets you embed bigger chunks without splitting. Match it to your typical document or passage size. |
| Domain and language | Retrieval accuracy on your data | A model trained on general web text may underperform on legal, medical, or code corpora. Check multilingual coverage if you need it. |
| Normalization | Consistency of similarity scores | Normalizing vectors to unit length makes cosine similarity stable. Most models expect it; confirm before mixing models. |
| Benchmark fit | A starting shortlist, not a verdict | Public rankings such as the Massive Text Embedding Benchmark give a general signal; validate finalists on your own queries. |
Two further details often surprise teams adopting embeddings for the first time. The first is that the vectors have to live somewhere built for similarity search, since scanning every vector for each query does not scale; vector databases use approximate nearest neighbor indexes such as HNSW to return close matches in sublinear time. The second is that some recent models support Matryoshka representations, where a single high-dimensional vector can be truncated to a shorter one with graceful quality loss, letting you trade accuracy for storage without re-embedding everything.
What embeddings do not do
Embeddings are a representation of meaning, not a substitute for reasoning, and treating them as more than they are leads to predictable failures. The vector captures how language is used, which correlates with meaning but does not encode truth. Two false statements about the same subject can sit close together in the space, and a correct answer phrased in unusual terms can sit surprisingly far from the question it answers. Similarity is a strong signal of relatedness, not a verdict on relevance or accuracy.
The representation is also lossy and fixed at the moment of encoding. Compressing a long document into a few hundred numbers necessarily discards detail, so fine-grained distinctions can wash out, and a vector produced today reflects the model that produced it. Switching embedding models means re-embedding the entire corpus, because vectors from different models live in different, incompatible spaces and cannot be compared. None of these are reasons to avoid embeddings; they are reasons to pair retrieval with verification rather than trusting proximity alone.
What this means in practice
Embeddings are the layer that lets a system reason about meaning at scale without running an expensive generative model on everything. Whenever a product needs to search by intent, retrieve relevant context, group similar items, or detect duplicates, the foundation is almost always an embedding model turning text into vectors and a similarity measure comparing them. The generative model gets the attention, but the embedding model is what makes finding the right input fast enough to be practical.
For teams building retrieval pipelines, the useful next step is to treat the embedding model as a component to evaluate rather than a default to accept, the same discipline that applies to picking a generative model. Compare a few candidates on a representative sample of your own queries, measure retrieval quality directly, and only then commit. The Inferbase model catalog tracks the open model families you can run behind a single endpoint, and the unified inference API lets you serve both the retrieval and generation halves of a pipeline through one integration.