Model Quantization in 2026: 4-bit, 8-bit, and the Tradeoffs

CallMissedMay 8, 2026

·6 min readGuide

Quantization LLM AI Infrastructure Performance

A 70-billion-parameter model in 16-bit weights wants ~140 GB of GPU memory. That is two A100 80GBs or one H100. In 4-bit weights it wants ~40 GB. That is one L40S, or even fits on a 48 GB consumer card. Quantization is the difference between "we need an expensive cluster" and "we can run this on hardware we already have." Here is what 2026 has settled on.

What quantization actually does

Modern LLMs store weights and activations as floating-point numbers — typically FP16 or BF16. Quantization replaces those with lower-precision integers (INT8, INT4) or smaller floats (FP8), reducing both storage and compute at the cost of some numerical precision.

The interesting part: most LLM weights live in a narrow distribution, and most predictions are robust to small perturbations. So well-designed quantization recovers most quality while shrinking models 4–8x.

The four formats that matter in 2026

GPTQ

The first 4-bit method to demonstrate that you could compress LLMs down to INT4 with minimal quality loss. Quantizes weights layer by layer using a calibration dataset. (PremAI guide)

Strength: mature, broad framework support

Weakness: slightly lower quality than AWQ in most modern benchmarks

AWQ (Activation-aware Weight Quantization)

MIT's approach. Instead of treating all weights equally, AWQ identifies the ~1% of weights with high activation magnitudes ("salient weights") and protects them.

Strength: typically retains 95%+ of FP16 quality at 4-bit. With the Marlin kernel, AWQ models can hit ~10x speedups over naive FP16 inference. (Jarvis Labs) [Unverified]

Weakness: narrower hardware support than GPTQ historically; 2026 has closed most of that gap.

GGUF

A file format rather than a quantization algorithm. GGUF is the storage format for llama.cpp and runs models on CPU, Mac unified memory, and consumer GPUs. The actual quantization underneath is K-quants (Q4_K_M is the popular sweet spot).

Strength: the best path for on-device, CPU, or Mac inference. Q4_K_M hits roughly 92% quality retention at ~3.5x compression. [Unverified]

Weakness: designed for llama.cpp's ecosystem; not a fit for vLLM/TGI server deployments.

bitsandbytes (NF4)

The 4-bit format used by QLoRA. Designed for training (fine-tuning) more than for serving.

Strength: the de facto standard for QLoRA fine-tuning.

Weakness: slower at inference than AWQ or GPTQ; not the right choice for production serving.

Quality at 4-bit: how much do you actually lose?

Aggregating across published benchmarks (perplexity on WikiText, MMLU, HumanEval) for 7B–70B models: [Inference]

Format	Approx. quality vs FP16	Speed
AWQ (4-bit)	~95–99%	Fast (Marlin kernel)
GPTQ (4-bit)	~90–95%	Fast
GGUF Q4_K_M	~92–94%	CPU/Mac native
bitsandbytes NF4	~93–96%	Slower at serving
INT8	~98–99%+	Faster than FP16
FP8 (Hopper/Blackwell)	~99%+	Native HW support

The picture: at 4-bit you generally lose 1–8% on quality benchmarks, with the exact number depending on model and task. For most application workloads (RAG, agents, classification) the loss is below the noise floor of your eval. For high-stakes reasoning, run your own eval before committing.

FP8: the new middle ground

NVIDIA's Hopper (H100/H200) and Blackwell (B100/B200) GPUs added native FP8. Where AWQ/GPTQ are post-hoc compression, FP8 is first-class hardware support. Quality is essentially indistinguishable from FP16 on most benchmarks, and you get ~2x throughput improvement on supported hardware. [Inference]

In 2026, if you are on H100/H200/B200, FP8 is the default for inference. INT4 is reserved for the situations where you cannot afford the memory budget for FP8.

When quantization makes sense

Memory pressure — you cannot fit the model in your GPU at FP16 — top reason.

Throughput at the same hardware — 4-bit roughly doubles the batch size you can hold, often producing 1.5–3x throughput on memory-bandwidth-bound inference.

Edge / on-device — GGUF is the answer for laptops, phones, and edge boxes. Mac with unified memory loves it.

Cost reduction at scale — running 70B at AWQ-4bit on a single L40S is materially cheaper than FP16 across two A100s.

When not to quantize

High-stakes reasoning where you have not measured quality. Always eval first.

Already memory-rich — if FP16 fits comfortably with batch headroom, the marginal benefit shrinks.

Training — quantize only with QLoRA-aware setups; don't quantize a model and then train normal weights on it.

A pragmatic 2026 recipe

For most teams self-hosting an open model:

First try FP8 if you have Hopper or Blackwell — it is essentially free quality-wise and well supported in vLLM and TensorRT-LLM.

Fall back to AWQ-4bit if FP8 is not available or memory is tight. Use Marlin kernels with vLLM.

Use GGUF Q4_K_M for laptops, Macs, edge devices, anywhere llama.cpp makes sense.

Eval on your task — perplexity is not your metric; your accuracy/latency mix is.

Bottom line

In 2026, quantization is not a research project — it is a serving default. AWQ and GPTQ at 4-bit run nearly all open-weight models at less than 5% quality cost while doubling effective hardware capacity. FP8 on modern NVIDIA chips makes the question "should I quantize?" easier still: yes, by default, with INT4 reserved for the memory-tight cases.

Frequently Asked Questions

Will quantization hurt my model's quality?

At 4-bit AWQ or GPTQ, you typically lose 1–8% on standard benchmarks. For most application workloads the loss is below your eval noise floor. Always measure on your own task before committing.

Should I use GGUF or AWQ?

GGUF is for llama.cpp (CPU, Mac, on-device, edge). AWQ is for GPU serving with vLLM, TGI, or SGLang. Pick based on where the model runs, not on which is "better."

Is FP8 the same as INT8?

No. FP8 is an 8-bit floating-point format with hardware support on H100, H200, and Blackwell. INT8 is integer quantization. FP8 typically retains more quality than INT8 because it preserves dynamic range better.