Browse any inference provider's model list and you will find the same base names wearing a parade of suffixes: FP8, Turbo, AWQ, LoRA, Distill, Instruct. They are usually presented as a menu of interchangeable options, as if each were simply a cheaper or faster way to get the same thing. That presentation hides a distinction that matters a great deal in practice, because these labels describe operations that are not remotely equivalent to one another.
Some of them leave a model's weights exactly as they were and only change how the computation runs. Some compress the weights, trading a little accuracy for memory and speed. And some produce an entirely new model that merely borrows a famous name. The reason this is worth getting right is that a published benchmark is measured on one specific artifact, and whether that score still applies to the version you are about to call depends entirely on which of these things the label actually did.
Why one model becomes many variants
The proliferation is not arbitrary. When a model's weights are open, every provider that hosts it starts from the same artifact, so there is nothing to compete on except how cheaply and quickly they can serve it. That pressure pushes providers toward quantization and serving optimizations, the two levers that lower cost and latency without needing different weights. The same release therefore appears across providers in a spread of precisions and serving configurations.
At the same time, the original creators and the open community keep producing genuinely new models from that base, through fine-tuning for specific tasks and distillation into smaller footprints. These are additions to the family rather than cheaper ways to serve the original. The result is that a single released model fans out into many served artifacts, all discoverable under variations of one name, even though they span all three of the tiers below. The names imply a set of equivalents, and the reality is anything but.
The one question that organizes everything
The cleanest way to make sense of model variants is to stop sorting them by their suffix and sort them by their effect instead. Every modification falls into one of three tiers, defined by a single question: does it change the model's measured behavior, and does the original benchmark still describe the result?
In the first tier, nothing about the model changes; only the execution does. In the second, the weights are compressed but still the same weights, so the benchmark mostly carries over with some risk. In the third, the modification produces a genuinely different model, and the original benchmark no longer applies at all. Once a variant is placed in the right tier, most of the confusion about what to expect from it disappears.
Serving optimizations: the same model, run faster
The first tier covers techniques that change how a model is served without touching its weights. Speculative decoding, continuous batching, optimized attention kernels, and similar infrastructure work all fall here. The forward pass produces the same outputs it always would, because it is mathematically the same computation. The benefit is entirely in latency, throughput, and cost.
Quality in this tier is identical to the original by construction, which makes these optimizations the closest thing to a free lunch in inference. There is one caveat worth keeping in mind. Vendor labels such as Turbo and Lite are branding, not technical specifications, and a Turbo endpoint sometimes denotes a quantized build rather than a pure serving optimization. The label tells you the provider has done something to make the endpoint faster or cheaper; it does not tell you whether the weights were left intact. To know which tier you are actually in, you have to look past the marketing name to the precision the endpoint serves.
Quantization: the same weights, compressed
The second tier is quantization, which stores a model's weights at lower numerical precision than the format they were trained in. Models are typically trained and served in 16-bit formats such as BF16, and quantization re-encodes those weights into something smaller: FP8 (8-bit floating point), INT8 (8-bit integer), or 4-bit formats produced by methods like AWQ (Activation-aware Weight Quantization) and GPTQ. The widely used GGUF format packages quantized weights at a range of bit widths and is common for local and CPU-bound inference.
The key property of quantization is that the weights are still the same weights, only approximated. This is why the original model's benchmark mostly transfers. Eight-bit formats are close to lossless on most tasks, with differences that frequently sit inside the noise of an evaluation run. Four-bit methods like AWQ and GPTQ were designed specifically to keep that loss small, and on their published evaluations they come close to the full-precision baseline.
The honest qualifier is that "mostly transfers" is not "always transfers." The loss grows as the bit width drops, and it is uneven across tasks. Reasoning, long-context retrieval, and code generation tend to be more sensitive to aggressive quantization than short factual questions, and very low bit widths degrade more visibly. Foundational results such as LLM.int8() showed that careful 8-bit quantization can preserve quality at scale, but the practical implication for a buyer is unchanged: a quantized endpoint inherits the original's benchmark as a strong prior, not a guarantee, and a workload that is sensitive to precision deserves its own check.
Adapters, fine-tunes, and distills: a different model wearing the name
The third tier is where the original benchmark stops applying. Each of these modifications produces a different model, even when the name suggests otherwise, so a score measured on the base says little about the result.
LoRA and QLoRA
LoRA, short for Low-Rank Adaptation, freezes the base model's weights and trains a small set of additional low-rank matrices that are applied on top. It is a cheap and portable way to adapt a model, since the result is a compact adapter rather than a full set of new weights. QLoRA extends the idea by training the adapter on top of a 4-bit quantized base, which makes fine-tuning large models feasible on modest hardware.
An adapter can be merged into the base weights, at which point it behaves like a fine-tune, or served as a separate layer on top of a shared base. The distinction that matters for selection is simpler than the mechanics: an adapted model is no longer the base model, so the base model's benchmark does not describe it. A hosted "-LoRA" endpoint is a particularly easy place to be misled, because it can look like the base with a small modifier when it is in fact a different model entirely.
Fine-tuning
Fine-tuning continues training a base model on new data and updates its weights directly. The instruction-tuned and chat-tuned models that dominate production are themselves fine-tunes of a pretrained base, which is why an "-Instruct" model behaves so differently from the raw base it came from. A domain fine-tune goes further, reshaping the model around a specific task or style.
Because fine-tuning changes the weights, it changes behavior in ways the base benchmark cannot predict. A fine-tune can be better than the base on its target task and worse on everything else, and only an evaluation on the relevant workload will tell you which.
Distillation
Distillation trains a smaller student model to imitate the outputs of a larger teacher. The student is a separate model with its own architecture, and despite the naming convention it is not a compressed copy of the teacher. This is the single most misread label in the current ecosystem.
The clearest example comes from the DeepSeek-R1 release. A model named DeepSeek-R1-Distill-Llama-8B is a Llama 8B base fine-tuned on reasoning traces generated by R1. It carries the R1 brand in its name, but it has Llama's architecture, Llama's size, and a quality profile of its own. Reading its capability off R1's benchmark would be a category error. A distilled model has to be evaluated as the distinct model it is.
A field guide to the labels
The table below places the common labels in their tier and answers the one question that matters for selection.
| Label | Tier | What it changes | Original benchmark still valid? |
|---|---|---|---|
| Turbo, Lite, Fast | Serving optimization (often quantization) | How the model is run, not the weights, unless the label hides a quant | Yes, if it is a true serving optimization |
| FP8, INT8 | Quantization | Weight precision, same weights | Mostly, close to lossless |
| AWQ, GPTQ, INT4, GGUF (4-bit) | Quantization | Weight precision, same weights | Mostly, with more task-dependent risk |
| LoRA, QLoRA | Adapter | Adds trained parameters on a frozen base | No, it is an adapted model |
| Instruct, Chat | Fine-tune | Post-trained weights | No, different from the raw base |
| Distill | Distillation | A separate student model | No, a different model entirely |
So which one should you use?
Start from what the variant does to quality, because that determines how much you can trust before you test. If you need exact fidelity to the published model, serve it at full precision and accept the higher cost. If you want it cheaper and faster with low risk, an 8-bit quantized build is the usual sweet spot, and it is worth a quick validation only if your workload leans on reasoning, long context, or code. Four-bit builds offer larger savings and deserve a real evaluation rather than an assumption, since this is where the quality gap starts to matter.
Serving optimizations in the first tier are the rare case where you can take the saving without a quality trade, provided you have confirmed the endpoint is genuinely a serving optimization and not a relabeled quant. The third tier is different in kind. An adapter, a fine-tune, or a distill should be chosen on its own evaluation, never on the base model's leaderboard number, because the name is the only thing it shares with the original.
There is a second decision hiding behind the first. Even once you have chosen the artifact you want, providers frequently serve a quantized or otherwise modified version under the base name, so the variant you select and the variant you are billed for are not always the same. Confirming the actual precision and lineage of an endpoint is part of choosing it, and it is the kind of detail our model catalog is built to make visible. For a deeper look at how to read the scores these variants are marketed with, our guide on LLM benchmarks covers why a headline number is often less informative than it appears.
Common misconceptions worth dropping
A few stubborn ideas cause most of the confusion, and naming them directly helps.
- Quantization does not teach a model anything new or change what it knows. It re-encodes the existing weights at lower precision.
- A LoRA adapter is neither a quantization nor a smaller model. It is extra parameters trained on top of a frozen base.
- A distilled model is not a shrunken version of its teacher. It is a different model trained to imitate the teacher's outputs.
- A serving optimization and a quantization are not the same trade. The first preserves quality exactly; the second exchanges a little quality for efficiency.
- The suffix is marketing, not a specification. The technique underneath is what places a variant in its tier, and that is what determines whether the benchmark you are relying on still applies.
The reliable habit that follows from all of this is to identify the tier before trusting the score. A variant in the first tier carries the original's reputation intact, a variant in the second carries it with a known risk, and a variant in the third carries only the name. Knowing which one you are looking at is most of the work of choosing well. When the stakes are high, the final step is the same one that has always separated good model selection from guesswork, which is to evaluate the specific build on your own task. Our guide on choosing the right model walks through how to structure that evaluation.