nm-research commited on
Commit
f5ef8cd
·
verified ·
1 Parent(s): d673039

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +236 -3
README.md CHANGED
@@ -1,3 +1,236 @@
1
- ---
2
- license: gemma
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ tags:
4
+ - vllm
5
+ - vision
6
+ - w4a16
7
+ license: gemma
8
+ base_model: google/gemma-3-27b-it
9
+ library_name: transformers
10
+ ---
11
+
12
+ # gemma-3-27b-it-quantized.w4a16
13
+
14
+ ## Model Overview
15
+ - **Model Architecture:** google/gemma-3-27b-it
16
+ - **Input:** Vision-Text
17
+ - **Output:** Text
18
+ - **Model Optimizations:**
19
+ - **Weight quantization:** INT4
20
+ - **Activation quantization:** FP16
21
+ - **Release Date:** 6/4/2025
22
+ - **Version:** 1.0
23
+ - **Model Developers:** RedHatAI
24
+
25
+ Quantized version of [google/gemma-3-27b-it](https://huggingface.co/google/gemma-3-27b-it).
26
+
27
+ ### Model Optimizations
28
+
29
+ This model was obtained by quantizing the weights of [google/gemma-3-27b-it](https://huggingface.co/google/gemma-3-27b-it) to INT4 data type, ready for inference with vLLM >= 0.8.0.
30
+
31
+ ## Deployment
32
+
33
+ ### Use with vLLM
34
+
35
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
36
+
37
+ ```python
38
+ from vllm.assets.image import ImageAsset
39
+ from vllm import LLM, SamplingParams
40
+
41
+ # prepare model
42
+ llm = LLM(
43
+ model="nm-testing/gemma-3-27b-it-quantized.w4a16",
44
+ trust_remote_code=True,
45
+ max_model_len=4096,
46
+ max_num_seqs=2,
47
+ )
48
+
49
+ # prepare inputs
50
+ question = "What is the content of this image?"
51
+ inputs = {
52
+ "prompt": f"<|user|>\n<|image_1|>\n{question}<|end|>\n<|assistant|>\n",
53
+ "multi_modal_data": {
54
+ "image": ImageAsset("cherry_blossom").pil_image.convert("RGB")
55
+ },
56
+ }
57
+
58
+ # generate response
59
+ print("========== SAMPLE GENERATION ==============")
60
+ outputs = llm.generate(inputs, SamplingParams(temperature=0.2, max_tokens=64))
61
+ print(f"PROMPT : {outputs[0].prompt}")
62
+ print(f"RESPONSE: {outputs[0].outputs[0].text}")
63
+ print("==========================================")
64
+ ```
65
+
66
+ vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
67
+
68
+ ## Creation
69
+
70
+ This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below:
71
+
72
+ <details>
73
+ <summary>Model Creation Code</summary>
74
+
75
+ ```python
76
+ import base64
77
+ from io import BytesIO
78
+ import torch
79
+ from datasets import load_dataset
80
+ from transformers import AutoProcessor, Gemma3ForConditionalGeneration
81
+ from llmcompressor.modifiers.quantization import GPTQModifier
82
+ from llmcompressor.transformers import oneshot
83
+
84
+
85
+ # Load model.
86
+ model_id = "google/gemma-3-27b-it"
87
+ model = Gemma3ForConditionalGeneration.from_pretrained(
88
+ model_id,
89
+ device_map="auto",
90
+ torch_dtype="auto",
91
+ )
92
+ processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
93
+
94
+ # Oneshot arguments
95
+ DATASET_ID = "neuralmagic/calibration"
96
+ DATASET_SPLIT = {"LLM": "train[:1024]"}
97
+ NUM_CALIBRATION_SAMPLES = 1024
98
+ MAX_SEQUENCE_LENGTH = 2048
99
+
100
+ # Load dataset and preprocess.
101
+ ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
102
+ ds = ds.shuffle(seed=42)
103
+
104
+ dampening_frac=0.07
105
+
106
+ def data_collator(batch):
107
+ assert len(batch) == 1, "Only batch size of 1 is supported for calibration"
108
+ item = batch[0]
109
+ collated = {}
110
+ import torch
111
+
112
+
113
+ for key, value in item.items():
114
+ if isinstance(value, torch.Tensor):
115
+ collated[key] = value.unsqueeze(0)
116
+ elif isinstance(value, list) and isinstance(value[0][0], int):
117
+ # Handle tokenized inputs like input_ids, attention_mask
118
+ collated[key] = torch.tensor(value)
119
+ elif isinstance(value, list) and isinstance(value[0][0], float):
120
+ # Handle possible float sequences
121
+ collated[key] = torch.tensor(value)
122
+ elif isinstance(value, list) and isinstance(value[0][0], torch.Tensor):
123
+ # Handle batched image data (e.g., pixel_values as [C, H, W])
124
+ collated[key] = torch.stack(value) # -> [1, C, H, W]
125
+ elif isinstance(value, torch.Tensor):
126
+ collated[key] = value
127
+ else:
128
+ print(f"[WARN] Unrecognized type in collator for key={key}, type={type(value)}")
129
+
130
+ return collated
131
+
132
+
133
+
134
+ # Recipe
135
+ recipe = [
136
+ GPTQModifier(
137
+ targets="Linear",
138
+ scheme="W4A16",
139
+ ignore: ["re:.*lm_head.*", "re:.*embed_tokens.*", "re:vision_tower.*", "re:multi_modal_projector.*"],
140
+ sequential_update: True,
141
+ )
142
+ ]
143
+
144
+ SAVE_DIR=f"{model_id.split('/')[1]}-quantized.w4a16"
145
+
146
+ # Perform oneshot
147
+ oneshot(
148
+ model=model,
149
+ tokenizer=model_id,
150
+ dataset=ds,
151
+ recipe=recipe,
152
+ max_seq_length=MAX_SEQUENCE_LENGTH,
153
+ num_calibration_samples=NUM_CALIBRATION_SAMPLES,
154
+ trust_remote_code_model=True,
155
+ data_collator=data_collator,
156
+ output_dir=SAVE_DIR
157
+ )
158
+ ```
159
+ </details>
160
+
161
+ ## Evaluation
162
+
163
+ The model was evaluated using [lm_evaluation_harness](https://github.com/neuralmagic/lm-evaluation-harness) for OpenLLM v1 text benchmark. The evaluations were conducted using the following commands:
164
+
165
+ <details>
166
+ <summary>Evaluation Commands</summary>
167
+
168
+ ### OpenLLM v1
169
+ ```
170
+ lm_eval \
171
+ --model vllm \
172
+ --model_args pretrained="<model_name>",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=<n>,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True,enforce_eager=True \
173
+ --tasks openllm \
174
+ --batch_size auto
175
+ ```
176
+ </details>
177
+
178
+
179
+ ### Accuracy
180
+
181
+ <table>
182
+ <thead>
183
+ <tr>
184
+ <th>Category</th>
185
+ <th>Metric</th>
186
+ <th>google/gemma-3-27b-it</th>
187
+ <th>nm-testing/gemma-3-27b-it-quantized.w8a8</th>
188
+ <th>Recovery (%)</th>
189
+ </tr>
190
+ </thead>
191
+ <tbody>
192
+ <tr>
193
+ <td rowspan="7"><b>OpenLLM V1</b></td>
194
+ <td>ARC Challenge</td>
195
+ <td>72.53%</td>
196
+ <td>72.35%</td>
197
+ <td>99.76%</td>
198
+ </tr>
199
+ <tr>
200
+ <td>GSM8K</td>
201
+ <td>92.12%</td>
202
+ <td>91.66%</td>
203
+ <td>99.51%</td>
204
+ </tr>
205
+ <tr>
206
+ <td>Hellaswag</td>
207
+ <td>85.78%</td>
208
+ <td>84.97%</td>
209
+ <td>99.06%</td>
210
+ </tr>
211
+ <tr>
212
+ <td>MMLU</td>
213
+ <td>77.53%</td>
214
+ <td>76.77%</td>
215
+ <td>99.02%</td>
216
+ </tr>
217
+ <tr>
218
+ <td>Truthfulqa (mc2)</td>
219
+ <td>62.20%</td>
220
+ <td>62.57%</td>
221
+ <td>100.59%</td>
222
+ </tr>
223
+ <tr>
224
+ <td>Winogrande</td>
225
+ <td>79.40%</td>
226
+ <td>79.79%%</td>
227
+ <td>100.50%</td>
228
+ </tr>
229
+ <tr>
230
+ <td><b>Average Score</b></td>
231
+ <td><b>78.26%</b></td>
232
+ <td><b>78.02%</b></td>
233
+ <td><b>99.70%</b></td>
234
+ </tr>
235
+ </tbody>
236
+ </table>