mratsim commited on
Commit
c873b3f
·
verified ·
1 Parent(s): 85033a8

Write Model Card

Browse files
Files changed (1) hide show
  1. README.md +56 -3
README.md CHANGED
@@ -1,3 +1,56 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ base_model:
4
+ - THUDM/GLM-4-32B-0414
5
+ datasets:
6
+ - mit-han-lab/pile-val-backup
7
+ ---
8
+
9
+ # GLM-4-32B-0414 Quantized with GPTQ (4-Bit weight-only, W4A16)
10
+
11
+ This repo contains GLM-4-32B-0414 quantized with asymmetric GPTQ to 4-bit to make it suitable for consumer hardware.
12
+
13
+ The model was calibrated with 2048 samples of max sequence length 4096 from the dataset [`mit-han-lab/pile-val-backup`](https://huggingface.co/datasets/mit-han-lab/pile-val-backup).
14
+
15
+ This is my very first quantized model, I welcome suggestions. 2048/4096 were chosen over the default of 512/2048 to minimize overfitting risk and maximize convergence.
16
+ They also happen to fit in my GPU.
17
+
18
+ Original Model:
19
+ - [THUDM/GLM-4-32B-0414](https://huggingface.co/THUDM/GLM-4-32B-0414)
20
+
21
+ ## 📥 Usage & Running Instructions
22
+
23
+ The model was tested with vLLM, here is a script suitable for 32GB VRAM GPUs.
24
+
25
+ ```
26
+ export MODEL="mratsim/GLM-4-32B-0414.w4a16-gptq"
27
+ vllm serve "${MODEL}" \
28
+ --served-model-name glm-4-32b \
29
+ --gpu-memory-utilization 0.90 \
30
+ --enable-prefix-caching \
31
+ --enable-chunked-prefill \
32
+ --max-model-len 130000 \
33
+ --max_num_seqs 256 \
34
+ --generation-config "${MODEL}" \
35
+ --enable-auto-tool-choice --tool-call-parser pythonic \
36
+ --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}'
37
+ ```
38
+
39
+ ## 🔬 Quantization method
40
+
41
+ The llmcompressor library was used with the following recipe for asymmetric GPTQ:
42
+
43
+ ```yaml
44
+ default_stage:
45
+ default_modifiers:
46
+ GPTQModifier:
47
+ dampening_frac: 0.005
48
+ config_groups:
49
+ group_0:
50
+ targets: [Linear]
51
+ weights: {num_bits: 4, type: int, symmetric: false, group_size: 128, strategy: group,
52
+ dynamic: false, observer: minmax}
53
+ ignore: [lm_head]
54
+ ```
55
+
56
+ and calibrated on 2048 samples, 4096 sequence length of [`mit-han-lab/pile-val-backup`](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)