Write Model Card
Browse files
README.md
CHANGED
@@ -1,3 +1,56 @@
|
|
1 |
-
---
|
2 |
-
license: mit
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
base_model:
|
4 |
+
- THUDM/GLM-4-32B-0414
|
5 |
+
datasets:
|
6 |
+
- mit-han-lab/pile-val-backup
|
7 |
+
---
|
8 |
+
|
9 |
+
# GLM-4-32B-0414 Quantized with GPTQ (4-Bit weight-only, W4A16)
|
10 |
+
|
11 |
+
This repo contains GLM-4-32B-0414 quantized with asymmetric GPTQ to 4-bit to make it suitable for consumer hardware.
|
12 |
+
|
13 |
+
The model was calibrated with 2048 samples of max sequence length 4096 from the dataset [`mit-han-lab/pile-val-backup`](https://huggingface.co/datasets/mit-han-lab/pile-val-backup).
|
14 |
+
|
15 |
+
This is my very first quantized model, I welcome suggestions. 2048/4096 were chosen over the default of 512/2048 to minimize overfitting risk and maximize convergence.
|
16 |
+
They also happen to fit in my GPU.
|
17 |
+
|
18 |
+
Original Model:
|
19 |
+
- [THUDM/GLM-4-32B-0414](https://huggingface.co/THUDM/GLM-4-32B-0414)
|
20 |
+
|
21 |
+
## 📥 Usage & Running Instructions
|
22 |
+
|
23 |
+
The model was tested with vLLM, here is a script suitable for 32GB VRAM GPUs.
|
24 |
+
|
25 |
+
```
|
26 |
+
export MODEL="mratsim/GLM-4-32B-0414.w4a16-gptq"
|
27 |
+
vllm serve "${MODEL}" \
|
28 |
+
--served-model-name glm-4-32b \
|
29 |
+
--gpu-memory-utilization 0.90 \
|
30 |
+
--enable-prefix-caching \
|
31 |
+
--enable-chunked-prefill \
|
32 |
+
--max-model-len 130000 \
|
33 |
+
--max_num_seqs 256 \
|
34 |
+
--generation-config "${MODEL}" \
|
35 |
+
--enable-auto-tool-choice --tool-call-parser pythonic \
|
36 |
+
--rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}'
|
37 |
+
```
|
38 |
+
|
39 |
+
## 🔬 Quantization method
|
40 |
+
|
41 |
+
The llmcompressor library was used with the following recipe for asymmetric GPTQ:
|
42 |
+
|
43 |
+
```yaml
|
44 |
+
default_stage:
|
45 |
+
default_modifiers:
|
46 |
+
GPTQModifier:
|
47 |
+
dampening_frac: 0.005
|
48 |
+
config_groups:
|
49 |
+
group_0:
|
50 |
+
targets: [Linear]
|
51 |
+
weights: {num_bits: 4, type: int, symmetric: false, group_size: 128, strategy: group,
|
52 |
+
dynamic: false, observer: minmax}
|
53 |
+
ignore: [lm_head]
|
54 |
+
```
|
55 |
+
|
56 |
+
and calibrated on 2048 samples, 4096 sequence length of [`mit-han-lab/pile-val-backup`](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)
|