In the early years of large language models, the assumption was that a model big enough would simply know everything it needed to know. The parameters would absorb the world, and the answer to any question would already be in the weights somewhere. That assumption broke down quickly, and in surprisingly stubborn ways. Models hallucinate confidently when they do not know an answer, lose track of facts that change after their training cutoff, and have no awareness of the documents inside any specific organization. By 2020 the limits were already visible enough that a Facebook AI Research team published a paper proposing a different architecture, one in which a retrieval step ran alongside the generation step and let the model condition its output on a corpus it had never been trained on. That architecture is now called retrieval-augmented generation, and it has since become the dominant pattern for connecting large language models to specific bodies of knowledge.
Retrieval-augmented generation (RAG) is the architecture in which an LLM answers a question by first retrieving the most relevant pieces of text from an external corpus, and then conditioning its generation on that retrieved context. The corpus can be a company's internal documentation, a knowledge base, a product catalog, a body of legal filings, a customer-support archive, or any other set of documents the model itself has not been trained on. RAG has become the default architecture for any production LLM application that needs to ground its output in specific source material, and the surrounding infrastructure (embedding models, vector databases, reranking models, chunking pipelines) has matured into a stack of its own. This post builds on the foundational guide to AI inference, the tool-calling primer, and the explanation of small language models, and looks at the retrieval architecture that increasingly sits in front of production LLM deployments.
What RAG actually is
A retrieval-augmented generation system is, at its core, two coordinated components: a retriever and a generator. The retriever takes the user's query and returns the most relevant passages from an external corpus. The generator (typically a large language model) takes the user's query together with the retrieved passages and produces an answer that is conditioned on both. The model itself is not retrained when the corpus changes; the knowledge lives outside the parameters and is consulted at query time.
This separation is what gives RAG its leverage. The corpus can grow, change, or be replaced without touching the model. The model itself can be swapped without rebuilding the index. New documents become available to the system the moment they are indexed. The same general-purpose model can serve dozens of different RAG applications, each with its own private corpus, simply by swapping the retriever's data source. RAG is, in this sense, an architectural pattern more than a model technique. The pattern turns a parametric model into a non-parametric system without changing the model itself.
The mechanism was first formalized in Lewis et al. (2020), where the authors trained a model to attend over a separately indexed knowledge corpus while generating each token. Production systems today rarely use the joint-training approach from the original paper, and instead use a much simpler pipeline of "retrieve, then prompt": the retrieval step runs as a standalone search, and the retrieved passages are inserted into the prompt sent to an off-the-shelf generative model. The simplification has held up well in practice and is the design that the rest of this post describes.
Why RAG exists
The case for RAG comes down to three structural limits of pure parametric models, each of which ends up being a fundamental problem rather than a temporary engineering gap.
The first is the training cutoff. A model knows what was in its training data, and its training data has a date stamp. By the time the model is deployed, weeks or months have already passed; by the time the model has been in production for a year, parts of its knowledge are visibly stale. Retraining is expensive, so the cutoff effectively persists for the lifetime of the deployment. RAG sidesteps the cutoff entirely, because the corpus can be re-indexed continuously and the freshest version of every document is what gets retrieved.
The second is hallucination on facts. When a parametric model is asked a factual question and is uncertain, it will often produce a plausible-sounding but wrong answer rather than admit uncertainty. The model has no internal mechanism to distinguish what it actually knows from what it is interpolating between similar training examples. RAG narrows this failure mode by giving the model a concrete passage to cite from. The model still hallucinates occasionally, but the prompt now contains a verifiable source, and the answer can be checked or rejected against that source rather than against the model's opaque memory.
The third is the proprietary-corpus problem. The model has not been trained on the customer's invoices, the company's internal documentation, last quarter's product roadmap, or the specific contract clauses that an enterprise needs to query against. No amount of parameter scaling addresses this, because the data was never in the training set in the first place. RAG addresses it directly: the retriever can index any text the application has access to, and the model can answer from that indexed corpus the same way it would answer from training data.
These three problems are not independent of each other and are the motivation for most production RAG deployments. Even when one or two of them seem manageable in isolation, the combination tends to push teams toward a retrieval architecture as the cleanest solution.
The core RAG pipeline
A production RAG system runs two pipelines that operate on different timescales. The indexing pipeline runs offline (or incrementally as documents change) and turns the corpus into a searchable index. The query pipeline runs online for every user request and uses the index to retrieve relevant passages, augment the prompt, and generate a response. The two pipelines share an embedding model and a vector store, and otherwise operate independently.
The indexing pipeline starts with the source corpus, which might be PDFs, wiki pages, support tickets, code repositories, or any other body of text the application has access to. Each document is split into chunks (passages of a few hundred tokens each), every chunk is run through an embedding model that produces a fixed-length vector, and the vectors are stored in a vector database alongside their source text and metadata. When the corpus changes (a document is added, edited, or deleted), the relevant chunks are re-embedded and the index is updated. The indexing pipeline is where most of the operational complexity of a RAG system actually lives.
The query pipeline runs a much shorter sequence. The user's query is embedded with the same embedding model, and the query vector is used to search the vector index for the most similar chunks, returning the top-k matches (commonly k = 5 to 20). The retrieved chunks are optionally passed through a reranker that re-scores them with a more expensive but more accurate model, narrowing top-k down to a smaller top-n.
The narrowed set is then inserted into the prompt sent to the generative model alongside the original query, and the generative model produces a response that is conditioned on both. The query pipeline typically adds 50 to 200 milliseconds of overhead per request, depending on the index size, the embedding latency, and whether reranking is enabled.
The components in detail
Each stage of the RAG pipeline has a small set of well-understood design choices, and the way those choices combine ends up determining the overall quality of the system more than any single component does.
Embedding model. The embedding model converts text into a fixed-length vector, which is the substrate the retriever operates on. The dominant production choices are OpenAI's text-embedding-3 series, Cohere's embed-v3, Voyage AI, and the open-weight BGE family. The choice matters because the embedding determines what "similar" means: a generic model will retrieve generically similar text, while a model fine-tuned on a particular domain will retrieve passages that are relevant to that domain's notion of similarity.
Chunking. The corpus has to be split into retrievable units, and the granularity of that split has a large effect on retrieval quality. Chunks that are too small lose context (a single sentence rarely contains enough information to answer a question on its own), while chunks that are too large dilute the relevance signal (an entire document chunk may match a query for the wrong reason). The working consensus is somewhere between 200 and 800 tokens per chunk with overlap between adjacent chunks, with the optimal point depending on the document type and the query distribution. Section-aware and recursive chunking strategies (for code, markdown, or structured documents) usually outperform fixed-window splits.
Vector store. The vector database stores the chunk embeddings and supports fast similarity search. The dominant production choices include Pinecone, Weaviate, Qdrant, Chroma, and the pgvector extension to PostgreSQL. The functional differences between them are smaller than the marketing suggests at small to medium scale; the operational differences (managed versus self-hosted, ease of integration, query language, metadata filtering) often dominate the choice.
Retriever. Vector similarity search is the default retrieval mechanism, but pure vector search has known weaknesses. Exact keyword matches (product codes, specific names, rare technical terms) are often handled better by classical BM25 lexical search, and most production systems run a hybrid retriever that combines vector and BM25 scores. Hybrid retrieval consistently outperforms either approach alone on the kind of queries that mix semantic intent with specific terms.
Reranker. The retriever returns the top-k chunks based on similarity, but similarity is a coarse signal. A reranker is a smaller model (usually a cross-encoder) that takes the query and each candidate chunk together and produces a more accurate relevance score. Reranking adds latency and cost but consistently improves retrieval quality, particularly when the corpus is large or the queries are subtle. Cohere Rerank, Voyage's reranker, and the open-weight BGE reranker are the common choices.
Augmentation. The retrieved chunks have to be inserted into the prompt sent to the generative model, and the way they are inserted affects how the model uses them. The naive approach is to concatenate the chunks at the top of the prompt, but production systems typically do more: they label each chunk with a source identifier, format them in a consistent template, and include explicit instructions on how the model should handle missing or contradictory context. The augmentation step is a frequent source of subtle quality issues, because a model that does not get its retrieved context in the format it expects will quietly underperform.
Generator. Any general-purpose LLM can be used as the generator in a RAG system, and the choice depends on the cost-quality trade-off the application can absorb. Frontier models (GPT-4, Claude Opus, Gemini Pro) are the default for hard or high-stakes RAG applications. Smaller open-weight models are increasingly viable for routine RAG, particularly when fine-tuned on the application's domain or when reranking has already narrowed the context to a small, high-quality set.
RAG variants
The basic "retrieve then generate" pipeline has produced a family of variants over the past few years, each addressing a specific weakness of the naive design.
Naive RAG is the single-shot baseline described above: embed the query, retrieve top-k, augment the prompt, generate. It works well on straightforward question answering against a coherent corpus and is the right starting point for most applications.
Advanced RAG adds a set of pre- and post-retrieval refinements. Query rewriting expands or rephrases the user's question before retrieval (a model rewrites a casual "what changed in the refund policy" into a more retrieval-friendly "refund policy changes effective dates terms"), and query decomposition splits a multi-part question into sub-questions that are retrieved independently. Hybrid retrieval combines vector and lexical search, and reranking refines the candidate set after retrieval. Each of these helps on a specific class of queries that naive RAG handles poorly, and most production systems eventually add several of them.
Agentic RAG moves the retrieval decision into the model itself. Rather than retrieving on every query, the model decides whether retrieval is needed, and if so, what to retrieve and how many times. The model can issue multiple retrievals in sequence, refining the query each time based on what it has already seen. This pattern shows up in agent frameworks and is increasingly common in production systems that handle complex multi-step questions.
GraphRAG uses a knowledge graph rather than (or in addition to) vector similarity. Documents are processed once to extract entities and relationships, and the graph structure is used to retrieve clusters of related information rather than independent passages. GraphRAG handles multi-hop questions ("which of our customers also use product X and have an open support ticket") more reliably than naive vector retrieval, at the cost of significantly higher indexing complexity.
Self-RAG trains the generator to evaluate its own retrieval. The model emits special tokens during generation that indicate whether retrieval was needed for the current span and whether the retrieved context was actually used. This produces a generation pass with built-in self-assessment, which can be used to gate weak retrievals or to decide when the model should refuse to answer.
Adaptive RAG routes queries to different retrieval strategies based on the query type. A simple fact lookup gets a fast vector search, a multi-hop question gets a graph or agentic pipeline, an out-of-corpus question gets refused. The router itself is usually a small classification model.
The choice between variants is usually a layered decision. Most production systems start with naive RAG, add advanced RAG techniques as quality plateaus, and adopt agentic or graph-based variants only when query complexity demands it.
What RAG does well
The strengths of RAG follow directly from the architecture. The system can answer from a corpus that the model has never been trained on, which means new documents become queryable as soon as they are indexed. Citations come built in, because the retrieved passages are concrete artifacts that the application can show to the user alongside the answer. Updates flow through the corpus rather than the model, which collapses what would otherwise be a months-long retrain cycle into a few minutes of incremental indexing.
RAG also scales well in a particular way. Adding a hundred thousand documents to a vector index is cheap and fast, while adding the same content to a model's training data requires a full retraining run. The marginal cost of corpus growth is dominated by storage and embedding compute, both of which scale linearly and predictably. This is what makes RAG the default architecture for any application where the corpus is large, frequently changing, or both.
The citation property turns out to be operationally important even when the user does not see the citations directly. Internal evaluation, debugging, and quality assurance all benefit from being able to point to the specific passages a generation was conditioned on, and audit trails for regulated industries (legal, healthcare, finance) often require source provenance that pure parametric models cannot easily produce.
Where RAG breaks
The failure modes of RAG are well-mapped at this point, and most of them concentrate around the retrieval step rather than the generation step.
Bad chunking destroys context. A passage split in the middle of a key sentence, a table fragmented across two chunks, or a section header separated from the section it labels can render the retrieved context useless. Chunking strategy is one of the largest single quality levers in a RAG system, and it tends to be tuned over many iterations rather than gotten right on the first try.
Embedding-content mismatch. A general-purpose embedding model used against a specialized corpus often retrieves generically similar text rather than domain-relevant text. Medical, legal, and technical corpora frequently benefit from domain-specific embedding models or from fine-tuning a general embedding model on the application's data.
Multi-hop reasoning. Naive vector retrieval handles single-fact questions well and struggles when the answer requires combining information from multiple documents. "What was the average response time of our top three customers last quarter" requires identifying the top three customers, retrieving each one's response times, aggregating the result, and answering. None of those steps is well-served by similarity search, and graph-based or agentic variants are usually required.
Hallucination still happens. A model given an irrelevant passage will sometimes generate an answer that confidently cites the irrelevant passage as evidence. Reranking and self-evaluation reduce this failure mode, though they do not eliminate it. The system needs application-level safeguards (calibrated refusal, source-citation verification, downstream eval) to catch confident wrong answers.
Operational complexity. A production RAG system has more moving parts than a pure model deployment: an indexing pipeline, an embedding service, a vector store, a retrieval service, an optional reranking service, prompt templates, and the generation model itself. Each of these is a deployment surface that has to be monitored, versioned, and updated. The total operational burden of a mature RAG system is comparable to a small data pipeline rather than a single model endpoint.
Latency cost. Retrieval and reranking add measurable latency to every request, typically in the 50 to 200 millisecond range. For latency-sensitive applications (voice agents, in-loop assistants), this overhead is often the gating factor on the architecture, and various caching and pre-fetching tricks become important.
RAG vs fine-tuning vs long context
A common question on RAG is when it is the right choice compared to the alternatives, particularly fine-tuning a smaller model on the corpus or using a long-context model and putting the corpus directly in the prompt.
Fine-tuning changes the model's behavior on a known distribution and is the right choice for tasks where the desired output has a specific style, format, or pattern that the team can demonstrate with examples. Fine-tuning is poor at adding new factual knowledge: the model learns to mimic the form of the training examples, but it does not reliably memorize facts in a way that the retrieval-style query can extract afterwards. Fine-tuning and RAG are also often complementary, with the fine-tune handling format and the retrieval handling the facts.
Long context has become more viable as model context windows have grown to hundreds of thousands of tokens. For small corpora that fit comfortably in a single prompt, long context is sometimes the simpler answer: the documents go directly into the prompt and the retrieval step is skipped entirely. The catch is that effective long-context retrieval is weaker than the advertised window length suggests, and the per-request cost grows linearly with the corpus size. RAG remains the better choice whenever the documents do not comfortably fit in the prompt budget on every request, and the breakeven point sits in the low tens of thousands of tokens for most applications.
The decision rarely reduces to a single axis. Production systems often use all three together: a fine-tuned smaller model handling format and tone, a RAG pipeline supplying current and proprietary facts, and a frontier model in long-context mode reserved for the small fraction of complex queries that justify the cost.
Summary table
| Dimension | RAG | Fine-tuning | Long context |
|---|---|---|---|
| Adds new factual knowledge | Yes | Poorly | Yes (per request) |
| Handles changing facts | Yes (re-index) | No (retrain) | Yes (swap docs in prompt) |
| Provides citations | Yes | No | Partial |
| Scales to large corpus | Yes | No | No |
| Adds inference latency | 50–200 ms | None | Higher per token |
| Adds infrastructure | Vector DB, indexing | Training pipeline | None |
| Per-request cost | Low | Low | High (long prompts) |
| Best for | Facts, large or changing corpus | Format, style, narrow behavior | Small corpora, one-off |
What this means in practice
RAG has settled into the default architecture for any LLM application that needs to answer from a specific body of knowledge, and the production patterns around it have stabilized. Most teams now build RAG systems as part of their standard application stack rather than as research projects, and the open-source frameworks (LangChain, LlamaIndex, Haystack) provide most of the orchestration glue out of the box.
The interesting design choices have moved up the stack. The naive RAG pipeline is largely a solved problem; the engineering work that differentiates a good RAG system from a mediocre one is in chunking strategy, embedding-model fit, hybrid retrieval design, reranker selection, prompt templating, and the application-level evaluation harness that catches quality regressions. None of these are choices that a framework can make automatically, and most of them require domain-specific iteration to get right.
For teams looking to prototype RAG workloads against multiple model families on a single API, the Inferbase unified inference API exposes both frontier and open-weight generation models with consistent behavior across the embedding, generation, and tool-calling surfaces. The model catalog covers the model-selection question for the generation step specifically, and the GPU sizing guide for LLM inference covers the related hardware question of self-hosting the generation model behind a RAG pipeline.