GLM 4.6 (EXL3 Quants)
- Original Model:
This repo contains:
- base quants (3, 4, 5, 6, 8 bits) for Exllamav3 (using SOTA random Hadamard transforms and Trellis quantization for high-quality reconstruction)
- layer and tensor level KL-divergence measurements for bit-allocation optimization given a target size
- theoretical research related to quantization, in particular MoE quantization
Motivation
The goals are:
- to provide the best possible quants for what is arguably the top general model of 2025
- to serve as a reference for quantization strategies (as of 2025 knowledge)
The base model is 355B parameters, which when 4-bit quantized should take about 177GiB, leaving almost 20GB for context, a perfect situation when you have 196GiB of VRAM (i.e. 8x 3090/4090, 6x 5090, $x RTX A6000, 4x RTX 6000 Ada or 2x RTX Pro 6000 Blackwell). Too bad all the 4-bit quants for my usual framework of choice, vllm, start at 191~200GiB of VRAM.
So while looking for a new backend that could leverage tensor parallelism, I landed on Exllamav3. And even better it had already in place the proper tools to fully quantized Mixture-of-Experts (MoE) models, unlike vllm/llmcompressor that requires you extra code to ensure all experts are activated (or their activation might be quantized away as unimportant if you have a non-comprehensive calibration dataset).
Artifacts
Base Quants
The base quants use the new "MCG" multiplier from https://github.com/turboderp-org/exllamav3/pull/26#issuecomment-3395345415
- Size measured through: https://github.com/turboderp-org/exllamav3/pull/103
- Kullback-Leibler divergence (KL-div) and Top-K agreement measured through: https://github.com/turboderp-org/exllamav3/blob/v0.0.14/eval/model_diff.py
- Perplexity measured through: https://github.com/turboderp-org/exllamav3/blob/v0.0.14/eval/model_diff.py
- Caveat both quantization calibration and perplexity use the same dataset in EXL3, hence we have overfitting.
The most appropriate measure for quality is KL-divergence (i.e. how well the quant reproduces the original probability distribution of token output, before samplers)
For example the 3-bit quant have lower perplexity than the original FP16.\
- Caveat both quantization calibration and perplexity use the same dataset in EXL3, hence we have overfitting.
| Quant | Size | KL-div (quant, FP16) | KL-div (FP16, quant) | Perplexity | Top-1 | Top-2 | Top-3 | Top-4 | Top-5 |
|---|---|---|---|---|---|---|---|---|---|
| 3bpw | 124 GiB | 0.32625636 | 0.30842110 | 4.36145115 | 0.8409 | 0.5497 | 0.3022 | 0.1527 | 0.0695 |
| 4bpw | 165 GiB | 0.15579397 | 0.15313307 | 4.64835933 | 0.8969 | 0.6892 | 0.4609 | 0.2840 | 0.1611 |
| 5bpw | 206 GiB | 0.11346048 | 0.10777174 | 4.46847223 | 0.9172 | 0.7553 | 0.5610 | 0.3868 | 0.2486 |
| 6bpw | 247 GiB | 0.08243355 | 0.07828716 | 4.46603787 | 0.9336 | 0.7970 | 0.6218 | 0.4600 | 0.3226 |
| 8bpw | 328 GiB | 0.06771311 | 0.06660905 | 4.61223994 | 0.9441 | 0.8221 | 0.6663 | 0.5155 | 0.3780 |
| FP16 | 656 GiB | 4.62864232 |
Optimized Quants
| Quant | Size | Context / VRAM | KL-div (quant, FP16) | KL-div (FP16, quant) | Perplexity | Top-1 | Top-2 | Top-3 | Top-4 | Top-5 |
|---|---|---|---|---|---|---|---|---|---|---|
| 3.84bpw-tuned🂱 | 158GiB GiB | 202752 tokens (max), k6v5 for 192GiB VRAM | 0.15942870 | 0.16406256 | 4.75754238 | 0.8881 | 0.6750 | 0.4481 | 0.2715 | 0.1520 |
| 4.16bpw-tuned🂱 | 171GiB GiB | 107520 tokens, k5v4 for 192GiB VRAM | 0.13325199 | 0.13198433 | 4.65080432 | 0.9061 | 0.7136 | 0.5029 | 0.3273 | 0.2002 |
- "opt🂡" for automatically optimized quants
- "tuned🂱" for hand-tuned quants
They can be downloaded with huggingface-cli with the following command:
hf download mratsim/GLM-4.6-EXL3 --revision 4.16bpw-tuned --local-dir /path/to/your/models/directory
Unfortunately, as of November 2025 automatically optimized quants are not able to beat hand-tuned heuristics and research-based mixed-precision quantization for I suspect one of 2 reasons (or both):
- An optimization algorithm with no backtracking, i.e. single-pass but not comparing current layer importance with past layer importance.
- Not taking synergies into account. Just like LLMs have emergent properties with size, it might be that up-quantizing certain projections significantly improve KL-divergence even if it appears as noise if we only measure improvement of a single up-quant.
Detailed measurements of KL-div improvements
Exllamav3 offers tools to measure per layer (with -l2) or even per-tensor (with -l3) contributions to KL-div improvements.
They might take 2 hours to 5 hours, if comparing 2 quants -- to 12 hours if comparing 3 quants -- to 24h of compute if comparing all quants.
Currently available are:
- 3vs4vs5
-l3: json, markdown - 4vs5vs6
-l3: json, markdown - 6vs8
-l3: json - 3vs4vs5vs6vs8
-l3: json, markdown
The json file can be fed to https://github.com/turboderp-org/exllamav3/blob/v0.0.14/util/optimize.py with a target bpw to output an optimized quant.
Please note that from experimentations, manual tuning using the heuristics below can achieve better KL-divergence than optimizing by only mixing 3 quants and is less likely to overfit the calibration set. Having shared experts or self_attn layers use 6 or even 8-bit provide a very large improvement to KL-divergence. Even a measurement with all available quants currently doesn't achieve manual tuning results.
Quantization theory and heuristics for manual tuning
Layers to quantize
Quantization should be focused on Linear layers (also called Dense or Fully-Connected layers i.e. MatMul+Bias) In particular quantizing LayerNorm/RMSnorm layer is strongly discouraged, see [1]
LayerNorm in Quantization. Kovaleva et al. (2021); Wei et al. (2022) find that outliers in the LayerNorm parameters of BERT (Devlin et al., 2019) cause difficulties in model compression. Given the importance of LayerNorm, all the quantization methods we discuss above leave LayerNorm unquantized.
This is also reported in Intel and Nvidia repo:
- https://github.com/intel/neural-compressor/issues/1963#issuecomment-2274873441
- https://github.com/NVIDIA/TensorRT/issues/4084#issuecomment-2294513950
EXL3 can only quantize linear layers.
Tensors to up-quantize
If there is enough bits, down projections should be prioritized.
According to [4]
Fig. 3: Maximum absolute value over layers for a LLaMA3-8B. Each color represent a different projection and we clearly see that down_proj has the biggest spikes in input and output. We also observe that RMSNorm propagate spikes through the entire model
According to [5]
Figure 5(a) illustrates the extremal ratio across layers and modules in LLaMA2-7B, highlighting that weight outliers are concentrated in the down-projection matrices Wdown ℓ of the second layer and the last two layers. Figures 5(b) and 5(c) provide detailed visualizations of these outliers in the last two layers.
Mixture-of-Experts quantization (MoE)
Mixture-of-Experts require specific quantization techniques.
Mixed-precision quantization
Some layers have a higher impact on LLM performance. According to [2], spending more bits in attention layers results in large gain compared to spending them in FFN layers. According to [3] on 2-bit quantization:
- quantizing expert FFN layers do not seriously impact model quality
- quantizing cross-attention has some impact
- quantizing self-attention has a large impact
- quantizing dense FFN has a very significant impact
Hence to preserve model quality we should choose not to quantize dense FFN layers and self-attention layers.
We notice that:
- official MXFP4 weights of gpt-oss-120b from OpenAI keep self-attention in BF16:
- NVFP4 weights of DeepSeek-R1 quantized by Nvidia also keep self-attention in BF16:
Layers with high-impact
According to [2], giving more bits to the first k blocks have a significantly higher impact on model quality than for the same last k blocks.
Expert quantization
When quantizing MoE, quantizing activations is tricky as only a subset of experts are activated per request.
EXL3 has the tooling in-place to ensure all experts are activated during quantization, though it is unsure if the dataset should be expanded to be diverse enough so that all experts have a high likelyhood of taking the full range of values they can exhibit to avoid clipping.
References
Why Do Some Inputs Break Low-Bit LLM Quantization? (2025)
Ting-Yun Chang, Muru Zhang, Jesse Thomason, Robin Jia
https://arxiv.org/pdf/2506.12044Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark (2024)
Pingzhi Li, Xiaolong Jin, Yu Cheng, Tianlong Chen
https://arxiv.org/pdf/2406.08155v1Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness (2023)
Young Jin Kim, Raffy Fahim, Hany Hassan Awadalla
https://arxiv.org/pdf/2310.02410Precision Where It Matters: A Novel Spike
Aware Mixed-Precision Quantization Strategy for
LLaMA-based Language Models (2025)
Lucas Maisonnave, Cyril Moineau, Olivier Bichler, and Fabrice Rastello
https://arxiv.org/pdf/2504.21553Systematic Outliers in Large Language Models (2025)
Yongqi An, Xu Zhao, Tao Yu, Ming Tang, Jinqiao Wang
https://arxiv.org/pdf/2502.06415v2
Model tree for mratsim/GLM-4.6-EXL3
Base model
zai-org/GLM-4.6