Model Quantization in 2026: 4-bit, 8-bit, and the Tradeoffs
A 70-billion-parameter model in 16-bit weights wants ~140 GB of GPU memory. That is two A100 80GBs or one H100. In 4-bit weights it wants ~40 GB. That is one L40S, or even fits on a 48 GB consumer card. Quantization is the difference between "we need an expensive cluster" and "we can run this on hardware we already have." Here is what 2026 has settled on.
What quantization actually does
Modern LLMs store weights and activations as floating-point numbers — typically FP16 or BF16. Quantization replaces those with lower-precision integers (INT8, INT4) or smaller floats (FP8), reducing both storage and compute at the cost of some numerical precision.
The interesting part: most LLM weights live in a narrow distribution, and most predictions are robust to small perturbations. So well-designed quantization recovers most quality while shrinking models 4–8x.
The four formats that matter in 2026
GPTQ
The first 4-bit method to demonstrate that you could compress LLMs down to INT4 with minimal quality loss. Quantizes weights layer by layer using a calibration dataset. (PremAI guide)
AWQ (Activation-aware Weight Quantization)
MIT's approach. Instead of treating all weights equally, AWQ identifies the ~1% of weights with high activation magnitudes ("salient weights") and protects them.
GGUF
A file format rather than a quantization algorithm. GGUF is the storage format for llama.cpp and runs models on CPU, Mac unified memory, and consumer GPUs. The actual quantization underneath is K-quants (Q4_K_M is the popular sweet spot).
bitsandbytes (NF4)
The 4-bit format used by QLoRA. Designed for training (fine-tuning) more than for serving.
Quality at 4-bit: how much do you actually lose?
Aggregating across published benchmarks (perplexity on WikiText, MMLU, HumanEval) for 7B–70B models: [Inference]
| Format | Approx. quality vs FP16 | Speed |
|---|---|---|
| AWQ (4-bit) | ~95–99% | Fast (Marlin kernel) |
| GPTQ (4-bit) | ~90–95% | Fast |
| GGUF Q4_K_M | ~92–94% | CPU/Mac native |
| bitsandbytes NF4 | ~93–96% | Slower at serving |
| INT8 | ~98–99%+ | Faster than FP16 |
| FP8 (Hopper/Blackwell) | ~99%+ | Native HW support |
The picture: at 4-bit you generally lose 1–8% on quality benchmarks, with the exact number depending on model and task. For most application workloads (RAG, agents, classification) the loss is below the noise floor of your eval. For high-stakes reasoning, run your own eval before committing.
FP8: the new middle ground
NVIDIA's Hopper (H100/H200) and Blackwell (B100/B200) GPUs added native FP8. Where AWQ/GPTQ are post-hoc compression, FP8 is first-class hardware support. Quality is essentially indistinguishable from FP16 on most benchmarks, and you get ~2x throughput improvement on supported hardware. [Inference]
In 2026, if you are on H100/H200/B200, FP8 is the default for inference. INT4 is reserved for the situations where you cannot afford the memory budget for FP8.
When quantization makes sense
When not to quantize
A pragmatic 2026 recipe
For most teams self-hosting an open model:
llama.cpp makes sense.Bottom line
In 2026, quantization is not a research project — it is a serving default. AWQ and GPTQ at 4-bit run nearly all open-weight models at less than 5% quality cost while doubling effective hardware capacity. FP8 on modern NVIDIA chips makes the question "should I quantize?" easier still: yes, by default, with INT4 reserved for the memory-tight cases.


