LLM quantization is one of the most practical ways to run large language models efficiently without materially changing the model architecture. Quantization reduces numerical precision for model weights and sometimes activations and KV cache, typically moving from FP16 or BF16 to formats like INT4, INT2, FP8, or FP4. The result is lower memory use, higher throughput, and easier deployment on GPUs, CPUs, and edge devices where memory bandwidth is the bottleneck for inference.

In modern serving stacks, quantization is no longer experimental. Weight-only INT4 is common in production, FP8 is increasingly a default for data center inference, and FP4 is maturing quickly with hardware and tooling support. This guide covers quantization for LLMs, including key techniques (GPTQ, AWQ, SmoothQuant, AQLM, VPTQ, AffineQuant, and LLM-QAT), what they optimize, and how to choose the right approach for your deployment context.

What Is LLM Quantization?

Quantization compresses an LLM by representing parameters and intermediate values with fewer bits. Instead of storing each weight in 16-bit floating point (FP16), you might store it in 4-bit integer (INT4) or 4-bit floating formats (FP4). Some approaches quantize:

Weights only: weights in 4-bit, activations remain FP16 or BF16.
Weights and activations: both are quantized (for example W4A4), which is generally harder to implement but can deliver faster inference.
KV cache: quantizing attention key-value memory to reduce long-context memory pressure.

LLM inference is frequently memory-bandwidth bound. Cutting weight precision from FP16 to 4-bit reduces weight bandwidth requirements by roughly 4x in weight-only schemes. In practice, this translates to higher tokens per second, larger batch sizes, and lower cost per request, particularly for long-running production endpoints.

Core Benefits of LLM Quantization

Memory reduction: 4-bit quantization can reduce model weight storage by roughly 4x compared to FP16, enabling larger models on the same hardware.
Inference speedup: modern low-bit kernels and formats can deliver roughly 2x to 3x throughput improvements in favorable settings, particularly with FP4-oriented stacks such as NVFP4 on supported GPUs.
Minimal accuracy loss: state-of-the-art post-training quantization methods can preserve near-original quality, with small perplexity changes even at very low bits for some model families.

LLM Quantization Categories: PTQ, QAT, and Weight-Only

1) Post-Training Quantization (PTQ)

PTQ converts a trained LLM to low precision without full retraining. It is widely adopted because it is fast to apply, straightforward to operationalize, and typically requires only a calibration dataset or a small set of sample prompts. Most production LLM quantization methods in use today are PTQ-based.

2) Quantization-Aware Training (QAT)

QAT simulates quantization effects during training or fine-tuning. It can outperform PTQ at very low bit widths or when quantizing components that are difficult to handle post-hoc, such as the KV cache. The trade-off is additional training complexity and compute cost.

3) Weight-Only Quantization

Weight-only quantization is the dominant deployment choice today. Activations are dynamic and can be harder to quantize robustly across diverse workloads. Serving stacks commonly adopt INT4 weights with FP16 activations as a strong quality-efficiency trade-off with broad toolchain support.

Key LLM Quantization Techniques

The methods below represent the current practical landscape and the direction research and deployment are heading through 2025 and 2026.

GPTQ (3-bit to 4-bit PTQ)

GPTQ is a PTQ method that uses second-order information to minimize quantization error layer by layer. It became a standard baseline for 4-bit quantization and is widely used for weight-only compression, offering strong accuracy retention compared with simpler rounding approaches.

AWQ (Activation-Aware Weight Quantization, 4-bit weights)

AWQ protects salient weights based on activation distributions, typically using per-channel scaling. This approach directly addresses a common pain point in transformer quantization: a small fraction of channels can dominate quantization error if handled naively. AWQ is widely used in production because it combines strong accuracy with straightforward deployment in weight-only mode.

SmoothQuant (Activation Outlier Handling)

SmoothQuant reduces the impact of activation outliers by smoothing the distribution across weights and activations. Outliers are a primary reason low-bit activation quantization fails on transformer models. SmoothQuant is often used as a building block in broader quantization pipelines, particularly when moving beyond weight-only schemes.

AQLM (Additive Quantization, Sub-3-Bit)

AQLM uses additive quantization with multiple codebooks per weight group, targeting strong Pareto trade-offs at 2-bit to 3-bit precision. For practitioners, this signals that sub-3-bit weights are increasingly viable, though success depends on advanced quantizers and optimized inference kernels.

VPTQ (Vector PTQ, 2-bit)

VPTQ moves from scalar quantization to vector quantization with second-order optimization and residual modeling. Reported results show meaningful accuracy gains on QA tasks for LLaMA-3 and Mistral-7B at 2-bit precision, along with throughput improvements in the range of 1.6x to 1.8x in tested settings. This reflects a broader trend: vector and structured quantization methods are making 2-bit models more practical than earlier PTQ generations.

AffineQuant (W4A4 with Affine Error Minimization)

AffineQuant improves low-bit performance by applying affine transformations to reduce quantization error, supporting configurations like W4A4. It is relevant when you need quantized activations (not just weights) while still targeting competitive perplexity and zero-shot task performance.

LLM-QAT (KV Cache Quantization)

KV cache can dominate memory usage during long-context inference. LLM-QAT approaches target sub-8-bit KV cache quantization with minimal accuracy loss, addressing a real bottleneck for enterprise chat and agentic workloads that require large context windows.

QuIP (2-Bit Viability via Matrix Properties)

QuIP leverages properties such as incoherence between weight and Hessian matrices to make 2-bit quantization more viable. While more research-oriented than other methods, it reinforces the point that 2-bit LLMs are no longer purely theoretical.

Hardware Formats and the Modern Serving Stack

Quantization outcomes depend heavily on hardware support and kernel maturity. Current deployment trends include:

FP8 as a default: many data center stacks standardize on FP8 for a favorable balance of performance and accuracy.
FP4 acceleration: FP4 formats such as NVFP4 are designed for newer GPU architectures and can yield strong throughput gains when paired with PTQ pipelines like AWQ and SmoothQuant.
INT4 weight-only as the workhorse: widely supported across open-source ecosystems and practical for both LLM serving and local inference.

When choosing between INT4 and FP4, the primary constraint is often kernel and hardware availability. FP4 can be compelling on supported GPUs, while INT4 remains broadly portable across infrastructure.

Benchmarks and What They Mean for Real Deployments

Several benchmark patterns are consistently useful for planning:

Weight-only INT4 generally provides a strong balance: large memory savings with minimal behavior drift across many LLM families.
2-bit approaches can reduce execution time substantially in specific studies, but require more sophisticated quantizers and may be more sensitive to task variation.
W4A4 and KV cache quantization become important when the bottleneck is activation memory or long contexts, not just model weight size.

Throughput gains in practice depend on batch size, sequence length, kernel quality, and whether the workload is truly bandwidth-bound. For production deployments, benchmark with representative prompts, context lengths, and concurrency targets before committing to a quantization strategy.

Real-World Applications of LLM Quantization

Edge and On-Device Inference

Quantization enables running larger LLMs on limited VRAM or CPU-based environments. Methods like AffineQuant and emerging sub-3-bit techniques support scenarios where 16-bit weights are not feasible, including offline assistants, private on-device summarization, and embedded copilots.

Production Inference and Cost Control

In data centers, quantization improves GPU utilization and lowers cost per token. Serving stacks commonly adopt AWQ or GPTQ for 4-bit weight-only quantization, while GPU-optimized toolchains can leverage FP8 and FP4 for additional throughput on supported hardware.

Local Developer Workflows

Formats like GGUF in llama.cpp have made quantized local inference mainstream, allowing developers to test, build workflows, and prototype agents without requiring high-end GPUs.

How to Choose an LLM Quantization Approach

Use this checklist to select a practical technique for your use case:

Start with weight-only INT4 if you want broad compatibility and low risk. Consider AWQ or GPTQ depending on your toolchain.
Use activation-aware methods (AWQ, SmoothQuant, AffineQuant) if you observe instability, outlier-driven errors, or want to pursue W4A4.
Consider KV cache quantization if long-context memory is your primary constraint. This is often the next bottleneck after weight compression.
Evaluate sub-3-bit only with rigorous benchmarks. Techniques like AQLM and VPTQ can be strong, but sensitivity varies across tasks.
Match to hardware: choose FP8 or FP4 if your GPUs and kernels support them efficiently; otherwise, INT4 is simpler and more portable.

Common Pitfalls and Best Practices

Calibration mismatch: PTQ quality depends on calibration data that reflects real prompts. Use samples representing your domains and typical context lengths.
Activation outliers: outliers can break low-bit activation quantization. Prefer methods that explicitly account for them.
Long-context regression: even when short prompts perform well, quality may degrade on long contexts due to KV cache and attention behavior. Test long-context scenarios explicitly.
Measure task quality beyond perplexity: include downstream checks such as tool-use accuracy, JSON validity, retrieval grounding behavior, and refusal correctness where relevant.

Learning Path: Skills for Applying LLM Quantization

Quantization knowledge pairs naturally with model deployment, optimization, and responsible AI practices for anyone building production LLM systems. Relevant credentials from Blockchain Council include:

Certified Generative AI Expert for end-to-end LLM understanding and applied generative AI workflows
Certified AI Engineer for production ML engineering foundations, evaluation, and deployment patterns
Certified Prompt Engineer for prompt robustness testing across quantized and non-quantized model variants

Conclusion

LLM quantization has evolved from simple rounding into a rich toolkit that supports production-grade compression with minimal quality loss. PTQ methods like GPTQ and AWQ power many real deployments today, SmoothQuant and AffineQuant address activation stability, and newer approaches like AQLM and VPTQ are pushing sub-3-bit models closer to practical use. KV cache quantization is also becoming critical for long-context applications.

For most teams, the most reliable starting point remains weight-only INT4 with a proven PTQ method, followed by targeted experimentation with activation and KV cache quantization where memory and latency constraints require it. As FP8 becomes standard and FP4 becomes more accessible, quantization will increasingly be the default approach for serving large models at scale.

LLM Quantization Techniques: Methods, Benchmarks, and Deployment Tips