GLM 4.6 (EXL3 Quants)

Original Model:
- zai-org/GLM-4.6

This repo contains:

base quants (3, 4, 5, 6, 8 bits) for Exllamav3 (using SOTA random Hadamard transforms and Trellis quantization for high-quality reconstruction)
layer and tensor level KL-divergence measurements for bit-allocation optimization given a target size
theoretical research related to quantization, in particular MoE quantization

Motivation

The goals are:

to provide the best possible quants for what is arguably the top general model of 2025
to serve as a reference for quantization strategies (as of 2025 knowledge)

The base model is 355B parameters, which when 4-bit quantized should take about 177GiB, leaving almost 20GB for context, a perfect situation when you have 196GiB of VRAM (i.e. 8x 3090/4090, 6x 5090, $x RTX A6000, 4x RTX 6000 Ada or 2x RTX Pro 6000 Blackwell). Too bad all the 4-bit quants for my usual framework of choice, vllm, start at 191~200GiB of VRAM.

So while looking for a new backend that could leverage tensor parallelism, I landed on Exllamav3. And even better it had already in place the proper tools to fully quantized Mixture-of-Experts (MoE) models, unlike vllm/llmcompressor that requires you extra code to ensure all experts are activated (or their activation might be quantized away as unimportant if you have a non-comprehensive calibration dataset).

Artifacts

Base Quants

The base quants use the new "MCG" multiplier from https://github.com/turboderp-org/exllamav3/pull/26#issuecomment-3395345415

Size measured through: https://github.com/turboderp-org/exllamav3/pull/103
Kullback-Leibler divergence (KL-div) and Top-K agreement measured through: https://github.com/turboderp-org/exllamav3/blob/v0.0.14/eval/model_diff.py
Perplexity measured through: https://github.com/turboderp-org/exllamav3/blob/v0.0.14/eval/model_diff.py
- Caveat both quantization calibration and perplexity use the same dataset in EXL3, hence we have overfitting.
  The most appropriate measure for quality is KL-divergence (i.e. how well the quant reproduces the original probability distribution of token output, before samplers)
  For example the 3-bit quant have lower perplexity than the original FP16.\

Quant	Size	KL-div (quant, FP16)	KL-div (FP16, quant)	Perplexity	Top-1	Top-2	Top-3	Top-4	Top-5
3bpw	124 GiB	0.32625636	0.30842110	4.36145115	0.8409	0.5497	0.3022	0.1527	0.0695
4bpw	165 GiB	0.15579397	0.15313307	4.64835933	0.8969	0.6892	0.4609	0.2840	0.1611
5bpw	206 GiB	0.11346048	0.10777174	4.46847223	0.9172	0.7553	0.5610	0.3868	0.2486
6bpw	247 GiB	0.08243355	0.07828716	4.46603787	0.9336	0.7970	0.6218	0.4600	0.3226
8bpw	328 GiB	0.06771311	0.06660905	4.61223994	0.9441	0.8221	0.6663	0.5155	0.3780
FP16	656 GiB			4.62864232

Optimized Quants

Quant	Size	Context / VRAM	KL-div (quant, FP16)	KL-div (FP16, quant)	Perplexity	Top-1	Top-2	Top-3	Top-4	Top-5
3.84bpw-tuned🂱	158GiB GiB	202752 tokens (max), k6v5 for 192GiB VRAM	0.15942870	0.16406256	4.75754238	0.8881	0.6750	0.4481	0.2715	0.1520
4.16bpw-tuned🂱	171GiB GiB	107520 tokens, k5v4 for 192GiB VRAM	0.13325199	0.13198433	4.65080432	0.9061	0.7136	0.5029	0.3273	0.2002

"opt🂡" for automatically optimized quants
"tuned🂱" for hand-tuned quants

They can be downloaded with huggingface-cli with the following command:

hf download mratsim/GLM-4.6-EXL3 --revision 4.16bpw-tuned --local-dir /path/to/your/models/directory

Unfortunately, as of November 2025 automatically optimized quants are not able to beat hand-tuned heuristics and research-based mixed-precision quantization for I suspect one of 2 reasons (or both):

An optimization algorithm with no backtracking, i.e. single-pass but not comparing current layer importance with past layer importance.
Not taking synergies into account. Just like LLMs have emergent properties with size, it might be that up-quantizing certain projections significantly improve KL-divergence even if it appears as noise if we only measure improvement of a single up-quant.

Detailed measurements of KL-div improvements

Exllamav3 offers tools to measure per layer (with -l2) or even per-tensor (with -l3) contributions to KL-div improvements. They might take 2 hours to 5 hours, if comparing 2 quants -- to 12 hours if comparing 3 quants -- to 24h of compute if comparing all quants.

Currently available are:

3vs4vs5 -l3: json, markdown
4vs5vs6 -l3: json, markdown
6vs8 -l3: json
3vs4vs5vs6vs8 -l3: json, markdown

The json file can be fed to https://github.com/turboderp-org/exllamav3/blob/v0.0.14/util/optimize.py with a target bpw to output an optimized quant.

Please note that from experimentations, manual tuning using the heuristics below can achieve better KL-divergence than optimizing by only mixing 3 quants and is less likely to overfit the calibration set. Having shared experts or self_attn layers use 6 or even 8-bit provide a very large improvement to KL-divergence. Even a measurement with all available quants currently doesn't achieve manual tuning results.

Quantization theory and heuristics for manual tuning

Layers to quantize

Quantization should be focused on Linear layers (also called Dense or Fully-Connected layers i.e. MatMul+Bias) In particular quantizing LayerNorm/RMSnorm layer is strongly discouraged, see [1]

LayerNorm in Quantization. Kovaleva et al. (2021); Wei et al. (2022) find that outliers in the LayerNorm parameters of BERT (Devlin et al., 2019) cause difficulties in model compression. Given the importance of LayerNorm, all the quantization methods we discuss above leave LayerNorm unquantized.

This is also reported in Intel and Nvidia repo:

EXL3 can only quantize linear layers.

Tensors to up-quantize

If there is enough bits, down projections should be prioritized.

According to [4]

Fig. 3: Maximum absolute value over layers for a LLaMA3-8B. Each color represent a different projection and we clearly see that down_proj has the biggest spikes in input and output. We also observe that RMSNorm propagate spikes through the entire model

According to [5]

Figure 5(a) illustrates the extremal ratio across layers and modules in LLaMA2-7B, highlighting that weight outliers are concentrated in the down-projection matrices Wdown ℓ of the second layer and the last two layers. Figures 5(b) and 5(c) provide detailed visualizations of these outliers in the last two layers.

Mixture-of-Experts quantization (MoE)

Mixture-of-Experts require specific quantization techniques.

Mixed-precision quantization

Some layers have a higher impact on LLM performance. According to [2], spending more bits in attention layers results in large gain compared to spending them in FFN layers. According to [3] on 2-bit quantization:

quantizing expert FFN layers do not seriously impact model quality
quantizing cross-attention has some impact
quantizing self-attention has a large impact
quantizing dense FFN has a very significant impact

Hence to preserve model quality we should choose not to quantize dense FFN layers and self-attention layers.

We notice that:

official MXFP4 weights of gpt-oss-120b from OpenAI keep self-attention in BF16:
- https://huggingface.co/openai/gpt-oss-120b/blob/main/model.safetensors.index.json
NVFP4 weights of DeepSeek-R1 quantized by Nvidia also keep self-attention in BF16:
- https://huggingface.co/nvidia/DeepSeek-R1-0528-FP4/blob/main/model.safetensors.index.json

Layers with high-impact

According to [2], giving more bits to the first k blocks have a significantly higher impact on model quality than for the same last k blocks.

Expert quantization

When quantizing MoE, quantizing activations is tricky as only a subset of experts are activated per request.

EXL3 has the tooling in-place to ensure all experts are activated during quantization, though it is unsure if the dataset should be expanded to be diverse enough so that all experts have a high likelyhood of taking the full range of values they can exhibit to avoid clipping.

References

Why Do Some Inputs Break Low-Bit LLM Quantization? (2025)
Ting-Yun Chang, Muru Zhang, Jesse Thomason, Robin Jia
https://arxiv.org/pdf/2506.12044
Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark (2024)
Pingzhi Li, Xiaolong Jin, Yu Cheng, Tianlong Chen
https://arxiv.org/pdf/2406.08155v1
Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness (2023)
Young Jin Kim, Raffy Fahim, Hany Hassan Awadalla
https://arxiv.org/pdf/2310.02410
Precision Where It Matters: A Novel Spike
Aware Mixed-Precision Quantization Strategy for
LLaMA-based Language Models (2025)
Lucas Maisonnave, Cyril Moineau, Olivier Bichler, and Fabrice Rastello
https://arxiv.org/pdf/2504.21553
Systematic Outliers in Large Language Models (2025)
Yongqi An, Xu Zhao, Tao Yu, Ming Tang, Jinqiao Wang
https://arxiv.org/pdf/2502.06415v2

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for mratsim/GLM-4.6-EXL3

Base model

zai-org/GLM-4.6

Quantized

(35)

this model

Collection including mratsim/GLM-4.6-EXL3

2025 - General Purpose >256B

Collection

1 item • Updated 8 days ago