mratsim
/

GLM-4-32B-0414.w4a16-gptq

Text Generation

text-generation-inference

compressed-tensors

Model card Files Files and versions

mratsim commited on May 4

Commit

c873b3f

·

verified ·

1 Parent(s): 85033a8

Write Model Card

Files changed (1) hide show

README.md +56 -3

README.md CHANGED Viewed

@@ -1,3 +1,56 @@
----
-license: mit
----

+---
+license: mit
+base_model:
+- THUDM/GLM-4-32B-0414
+datasets:
+- mit-han-lab/pile-val-backup
+---
+# GLM-4-32B-0414 Quantized with GPTQ (4-Bit weight-only, W4A16)
+This repo contains GLM-4-32B-0414 quantized with asymmetric GPTQ to 4-bit to make it suitable for consumer hardware.
+The model was calibrated with 2048 samples of max sequence length 4096 from the dataset [`mit-han-lab/pile-val-backup`](https://huggingface.co/datasets/mit-han-lab/pile-val-backup).
+This is my very first quantized model, I welcome suggestions. 2048/4096 were chosen over the default of 512/2048 to minimize overfitting risk and maximize convergence.
+They also happen to fit in my GPU.
+Original Model:
+  - [THUDM/GLM-4-32B-0414](https://huggingface.co/THUDM/GLM-4-32B-0414)
+## 📥 Usage & Running Instructions
+The model was tested with vLLM, here is a script suitable for 32GB VRAM GPUs.
+```
+export MODEL="mratsim/GLM-4-32B-0414.w4a16-gptq"
+vllm serve "${MODEL}" \
+  --served-model-name glm-4-32b \
+  --gpu-memory-utilization 0.90 \
+  --enable-prefix-caching \
+  --enable-chunked-prefill \
+  --max-model-len 130000 \
+  --max_num_seqs 256 \
+  --generation-config "${MODEL}" \
+  --enable-auto-tool-choice --tool-call-parser pythonic \
+  --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}'
+```
+## 🔬 Quantization method
+The llmcompressor library was used with the following recipe for asymmetric GPTQ:
+```yaml
+default_stage:
+  default_modifiers:
+    GPTQModifier:
+      dampening_frac: 0.005
+      config_groups:
+        group_0:
+          targets: [Linear]
+          weights: {num_bits: 4, type: int, symmetric: false, group_size: 128, strategy: group,
+            dynamic: false, observer: minmax}
+      ignore: [lm_head]
+```
+and calibrated on 2048 samples, 4096 sequence length of [`mit-han-lab/pile-val-backup`](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)