Does quantization make a model worse?

Usually only slightly, and the amount depends on how aggressive the quantization is. Eight-bit formats like FP8 and INT8 are close to lossless on most tasks, with differences that often fall within evaluation noise. Four-bit methods like AWQ and GPTQ are designed to keep the loss small, but the gap grows on harder workloads, and very low bit widths degrade more noticeably. The loss is also task-dependent: reasoning, long-context, and code generation tend to be more sensitive than short factual answers. The practical takeaway is that quantization preserves the same weights in a compressed form, so the original benchmark mostly transfers, but it is not guaranteed to, and the only way to be sure for your workload is to evaluate the specific quantized build you are served.

Is a distilled model a smaller version of the original?

No. A distilled model is a separate model trained to imitate the outputs of a larger teacher model, and it usually has a different architecture and its own quality profile. The naming is the most common source of confusion. For example, DeepSeek-R1-Distill-Llama-8B is a Llama 8B model fine-tuned on reasoning traces generated by DeepSeek-R1; it is not a compressed copy of R1. It carries the teacher's brand in its name but inherits the student's size and behavior, so the teacher's benchmark does not describe it. A distilled model must be evaluated on its own.

What is the difference between LoRA and fine-tuning?

Both adapt a model to new data, but they differ in what they change. Fine-tuning updates the model's weights directly through continued training. LoRA (Low-Rank Adaptation) freezes the base weights and trains a small set of additional low-rank matrices that are applied on top, which is far cheaper and produces a portable adapter rather than a full new model. A LoRA adapter can be merged into the base weights, at which point it behaves like a fine-tune, or served separately on top of a shared base. Either way, an adapted model is a different artifact from the base, so the base model's benchmark no longer applies to it.

What does a Turbo or FP8 label mean on a hosted model?

FP8 is a specific technique: the weights are stored in an 8-bit floating point format to save memory and increase speed, while remaining the same weights. Turbo, Lite, and similar labels are vendor branding rather than precise specifications, and they often denote a quantized build (FP8 is common) or a serving optimization, but the label alone does not tell you which. Because the label is marketing and not a spec, the reliable approach is to find the actual precision the endpoint serves before assuming the original model's quality applies.

Quantized, Distilled, or Fine-Tuned: What the Labels Mean

Browse any inference provider's model list and you will find the same base names wearing a parade of suffixes: FP8, Turbo, AWQ, LoRA, Distill, Instruct. They are usually presented as a menu of interchangeable options, as if each were simply a cheaper or faster way to get the same thing. That presentation hides a distinction that matters a great deal in practice, because these labels describe operations that are not remotely equivalent to one another.

Some of them leave a model's weights exactly as they were and only change how the computation runs. Some compress the weights, trading a little accuracy for memory and speed. And some produce an entirely new model that merely borrows a famous name. The reason this is worth getting right is that a published benchmark is measured on one specific artifact, and whether that score still applies to the version you are about to call depends entirely on which of these things the label actually did.

Why one model becomes many variants

The proliferation is not arbitrary. When a model's weights are open, every provider that hosts it starts from the same artifact, so there is nothing to compete on except how cheaply and quickly they can serve it. That pressure pushes providers toward quantization and serving optimizations, the two levers that lower cost and latency without needing different weights. The same release therefore appears across providers in a spread of precisions and serving configurations.

At the same time, the original creators and the open community keep producing genuinely new models from that base, through fine-tuning for specific tasks and distillation into smaller footprints. These are additions to the family rather than cheaper ways to serve the original. The result is that a single released model fans out into many served artifacts, all discoverable under variations of one name, even though they span all three of the tiers below. The names imply a set of equivalents, and the reality is anything but.

The one question that organizes everything

The cleanest way to make sense of model variants is to stop sorting them by their suffix and sort them by their effect instead. Every modification falls into one of three tiers, defined by a single question: does it change the model's measured behavior, and does the original benchmark still describe the result?

In the first tier, nothing about the model changes; only the execution does. In the second, the weights are compressed but still the same weights, so the benchmark mostly carries over with some risk. In the third, the modification produces a genuinely different model, and the original benchmark no longer applies at all. Once a variant is placed in the right tier, most of the confusion about what to expect from it disappears.

Three tiers of model variants, sorted by what each label does to the model, with whether the original benchmark still applies: serving optimization (yes), quantization (mostly), and a different model such as a LoRA, fine-tune, or distill (no)

Serving optimizations: the same model, run faster

The first tier covers techniques that change how a model is served without touching its weights. Speculative decoding, continuous batching, optimized attention kernels, and similar infrastructure work all fall here. The forward pass produces the same outputs it always would, because it is mathematically the same computation. The benefit is entirely in latency, throughput, and cost.

Quality in this tier is identical to the original by construction, which makes these optimizations the closest thing to a free lunch in inference. There is one caveat worth keeping in mind. Vendor labels such as Turbo and Lite are branding, not technical specifications, and a Turbo endpoint sometimes denotes a quantized build rather than a pure serving optimization. The label tells you the provider has done something to make the endpoint faster or cheaper; it does not tell you whether the weights were left intact. To know which tier you are actually in, you have to look past the marketing name to the precision the endpoint serves.

Quantization: the same weights, compressed

The second tier is quantization, which stores a model's weights at lower numerical precision than the format they were trained in. Models are typically trained and served in 16-bit formats such as BF16, and quantization re-encodes those weights into something smaller: FP8 (8-bit floating point), INT8 (8-bit integer), or 4-bit formats produced by methods like AWQ (Activation-aware Weight Quantization) and GPTQ. The widely used GGUF format packages quantized weights at a range of bit widths and is common for local and CPU-bound inference.

The key property of quantization is that the weights are still the same weights, only approximated. This is why the original model's benchmark mostly transfers. Eight-bit formats are close to lossless on most tasks, with differences that frequently sit inside the noise of an evaluation run. Four-bit methods like AWQ and GPTQ were designed specifically to keep that loss small, and on their published evaluations they come close to the full-precision baseline.

The honest qualifier is that "mostly transfers" is not "always transfers." The loss grows as the bit width drops, and it is uneven across tasks. Reasoning, long-context retrieval, and code generation tend to be more sensitive to aggressive quantization than short factual questions, and very low bit widths degrade more visibly. Foundational results such as LLM.int8() showed that careful 8-bit quantization can preserve quality at scale, but the practical implication for a buyer is unchanged: a quantized endpoint inherits the original's benchmark as a strong prior, not a guarantee, and a workload that is sensitive to precision deserves its own check.

Adapters, fine-tunes, and distills: a different model wearing the name

The third tier is where the original benchmark stops applying. Each of these modifications produces a different model, even when the name suggests otherwise, so a score measured on the base says little about the result.

LoRA and QLoRA

LoRA, short for Low-Rank Adaptation, freezes the base model's weights and trains a small set of additional low-rank matrices that are applied on top. It is a cheap and portable way to adapt a model, since the result is a compact adapter rather than a full set of new weights. QLoRA extends the idea by training the adapter on top of a 4-bit quantized base, which makes fine-tuning large models feasible on modest hardware.

An adapter can be merged into the base weights, at which point it behaves like a fine-tune, or served as a separate layer on top of a shared base. The distinction that matters for selection is simpler than the mechanics: an adapted model is no longer the base model, so the base model's benchmark does not describe it. A hosted "-LoRA" endpoint is a particularly easy place to be misled, because it can look like the base with a small modifier when it is in fact a different model entirely.

Fine-tuning

Fine-tuning continues training a base model on new data and updates its weights directly. The instruction-tuned and chat-tuned models that dominate production are themselves fine-tunes of a pretrained base, which is why an "-Instruct" model behaves so differently from the raw base it came from. A domain fine-tune goes further, reshaping the model around a specific task or style.

Because fine-tuning changes the weights, it changes behavior in ways the base benchmark cannot predict. A fine-tune can be better than the base on its target task and worse on everything else, and only an evaluation on the relevant workload will tell you which.

Distillation

Distillation trains a smaller student model to imitate the outputs of a larger teacher. The student is a separate model with its own architecture, and despite the naming convention it is not a compressed copy of the teacher. This is the single most misread label in the current ecosystem.

The clearest example comes from the DeepSeek-R1 release. A model named DeepSeek-R1-Distill-Llama-8B is a Llama 8B base fine-tuned on reasoning traces generated by R1. It carries the R1 brand in its name, but it has Llama's architecture, Llama's size, and a quality profile of its own. Reading its capability off R1's benchmark would be a category error. A distilled model has to be evaluated as the distinct model it is.

A field guide to the labels

The table below places the common labels in their tier and answers the one question that matters for selection.

Label	Tier	What it changes	Original benchmark still valid?
Turbo, Lite, Fast	Serving optimization (often quantization)	How the model is run, not the weights, unless the label hides a quant	Yes, if it is a true serving optimization
FP8, INT8	Quantization	Weight precision, same weights	Mostly, close to lossless
AWQ, GPTQ, INT4, GGUF (4-bit)	Quantization	Weight precision, same weights	Mostly, with more task-dependent risk
LoRA, QLoRA	Adapter	Adds trained parameters on a frozen base	No, it is an adapted model
Instruct, Chat	Fine-tune	Post-trained weights	No, different from the raw base
Distill	Distillation	A separate student model	No, a different model entirely

So which one should you use?

Start from what the variant does to quality, because that determines how much you can trust before you test. If you need exact fidelity to the published model, serve it at full precision and accept the higher cost. If you want it cheaper and faster with low risk, an 8-bit quantized build is the usual sweet spot, and it is worth a quick validation only if your workload leans on reasoning, long context, or code. Four-bit builds offer larger savings and deserve a real evaluation rather than an assumption, since this is where the quality gap starts to matter.

Serving optimizations in the first tier are the rare case where you can take the saving without a quality trade, provided you have confirmed the endpoint is genuinely a serving optimization and not a relabeled quant. The third tier is different in kind. An adapter, a fine-tune, or a distill should be chosen on its own evaluation, never on the base model's leaderboard number, because the name is the only thing it shares with the original.

There is a second decision hiding behind the first. Even once you have chosen the artifact you want, providers frequently serve a quantized or otherwise modified version under the base name, so the variant you select and the variant you are billed for are not always the same. Confirming the actual precision and lineage of an endpoint is part of choosing it, and it is the kind of detail our model catalog is built to make visible. For a deeper look at how to read the scores these variants are marketed with, our guide on LLM benchmarks covers why a headline number is often less informative than it appears.

Common misconceptions worth dropping

A few stubborn ideas cause most of the confusion, and naming them directly helps.

Quantization does not teach a model anything new or change what it knows. It re-encodes the existing weights at lower precision.
A LoRA adapter is neither a quantization nor a smaller model. It is extra parameters trained on top of a frozen base.
A distilled model is not a shrunken version of its teacher. It is a different model trained to imitate the teacher's outputs.
A serving optimization and a quantization are not the same trade. The first preserves quality exactly; the second exchanges a little quality for efficiency.
The suffix is marketing, not a specification. The technique underneath is what places a variant in its tier, and that is what determines whether the benchmark you are relying on still applies.

The reliable habit that follows from all of this is to identify the tier before trusting the score. A variant in the first tier carries the original's reputation intact, a variant in the second carries it with a known risk, and a variant in the third carries only the name. Knowing which one you are looking at is most of the work of choosing well. When the stakes are high, the final step is the same one that has always separated good model selection from guesswork, which is to evaluate the specific build on your own task. Our guide on choosing the right model walks through how to structure that evaluation.

Quantized, Distilled, or Fine-Tuned: What the Labels Mean

Why one model becomes many variants

The one question that organizes everything

Serving optimizations: the same model, run faster

Quantization: the same weights, compressed

Adapters, fine-tunes, and distills: a different model wearing the name

LoRA and QLoRA

Fine-tuning

Distillation

A field guide to the labels

So which one should you use?

Common misconceptions worth dropping

Frequently asked questions

Have thoughts on this article?

Related Articles

AI Model Comparison: How to Compare LLMs Across Benchmarks, Pricing, and Capabilities

How to Choose the Right AI Model for Your Project

Llama 3.3 70B Sizing Across H100, H200, and B200

Stay up to date

Start building with the right model.