Problem with model quantization

#12
by bialykostek - opened

Hi,
I've fine-tuned the model and before deployment I wanted to quantize it. I've tried to use llm-compressor as autoAWQ is deprecated, but there is error during calibration process in get_vllm_embedding function. To recreate problem just use this code (from llm-compressor examples):

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
from llmcompressor.utils import dispatch_for_generation

# Select model and load it.
MODEL_ID = "openbmb/MiniCPM-V-4_5"            # Qwen/Qwen3-8B works well 
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Select calibration dataset.
DATASET_ID = "HuggingFaceH4/ultrachat_200k"
DATASET_SPLIT = "train_sft"

NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048

# Load dataset and preprocess.
ds = load_dataset(DATASET_ID, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]")
ds = ds.shuffle(seed=42)

def preprocess(example):
    return {
        "text": tokenizer.apply_chat_template(
            example["messages"],
            tokenize=False,
        )
    }

ds = ds.map(preprocess)

# Tokenize inputs.
def tokenize(sample):
    return tokenizer(
        sample["text"],
        padding=False,
        max_length=MAX_SEQUENCE_LENGTH,
        truncation=True,
        add_special_tokens=False,
    )

ds = ds.map(tokenize, remove_columns=ds.column_names)

# Configure algorithms. 
recipe = [
    SmoothQuantModifier(smoothing_strength=0.8),
    GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]),
]

# Apply algorithms and save to output_dir
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    trust_remote_code_model=True
)

Traceback:

Traceback (most recent call last):
  File "/home/bialykostek/4oc/MiniCPM-V/finetune/LLaMA-Factory/models/comp3.py", line 63, in <module>
    oneshot(
  File "/home/bialykostek/4oc/.venv/lib/python3.12/site-packages/llmcompressor/entrypoints/oneshot.py", line 319, in oneshot
    one_shot()
  File "/home/bialykostek/4oc/.venv/lib/python3.12/site-packages/llmcompressor/entrypoints/oneshot.py", line 149, in __call__
    self.apply_recipe_modifiers(
  File "/home/bialykostek/4oc/.venv/lib/python3.12/site-packages/llmcompressor/entrypoints/oneshot.py", line 192, in apply_recipe_modifiers
    pipeline(
  File "/home/bialykostek/4oc/.venv/lib/python3.12/site-packages/llmcompressor/pipelines/independent/pipeline.py", line 45, in __call__
    pipeline(model, dataloader, dataset_args)
  File "/home/bialykostek/4oc/.venv/lib/python3.12/site-packages/llmcompressor/pipelines/sequential/pipeline.py", line 72, in __call__
    subgraphs = trace_subgraphs(model, sample_input, sequential_targets, ignore)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bialykostek/4oc/.venv/lib/python3.12/site-packages/llmcompressor/pipelines/sequential/helpers.py", line 125, in trace_subgraphs
    tracer.trace(
  File "/home/bialykostek/4oc/.venv/lib/python3.12/site-packages/transformers/utils/fx.py", line 1315, in trace
    self.graph = super().trace(root, concrete_args=concrete_args)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bialykostek/4oc/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/bialykostek/4oc/.venv/lib/python3.12/site-packages/torch/fx/_symbolic_trace.py", line 837, in trace
    (self.create_arg(fn(*args)),),
                     ^^^^^^^^^
  File "MiniCPMV_8730936129375_autowrapped", line -1, in forward
  File "/home/bialykostek/.cache/huggingface/modules/transformers_modules/openbmb/MiniCPM-V-4_5/0fe9c69d46b5539b14521791f38e96e9ed007ff9/modeling_minicpmv.py", line 79, in get_vllm_embedding
    if 'vision_hidden_states' not in data:
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: argument of type 'type' is not iterable

I've tried different configuration but at the end always faced the same error. Is the problem in model wrapper? Can you help me with that? Alternatively, can you tell me which software supports quantization of your model?

Thanks!

OpenBMB org

@bialykostek
https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/quantization/awq/minicpm-v4_5_awq_quantize.md#method-2-use-the-pre-quantized-model
I apologize for not replying immediately. I might have missed it while checking my email reminder.
As you mentioned, official development of autoAWQ has been abandoned.
However, we are supporting our own autoAWQ repo and have also developed model quantization and corresponding tutorials.
You can refer to the documentation above for usage. I hope it helps.

Hi!
Today i spent few hours trying to make it work - unfortunately, this script from cookbook (method 3, manual quantization - I'm trying to quantize finetunned model) did not work, there was an error in same get_vllm_embedding function. Finally, I managed to fix this by pasting piece of code copied from MiniCPM-V-4 to modeling_minicpm.py file. At the beginning of forward() function I've added this data parser:

def forward(self, data, **kwargs):
    ### pasted from MiniCPM-V-4 
    if isinstance(data, torch.Tensor):
        attention_mask = torch.ones_like(data, dtype=torch.bool)
        kwargs = {'attention_mask': attention_mask}
        return self.llm(
            input_ids=data,
            **kwargs
        )

    if data is None:
        data = {
            "input_ids": kwargs.pop("input_ids", None),
            "pixel_values": kwargs.pop("pixel_values", None),
            "image_bound": kwargs.pop("image_bound", None),
            "tgt_sizes": kwargs.pop("tgt_sizes", None),
            "position_ids": kwargs.pop("position_ids", None),
        }
    else:
        kwargs.pop("input_ids", None)
        kwargs.pop("pixel_values", None)
        kwargs.pop("image_bound", None)
        kwargs.pop("tgt_sizes", None)
        kwargs.pop("position_ids", None)
    kwargs.pop("inputs_embeds", None)
    ### rest without changes
    vllm_embedding, vision_hidden_states = self.get_vllm_embedding(data)

Not sure if it was deleted for a reason, but now it works perfectly. Maybe it would also fix llm-compressor error I've mentioned before - to be tested.
Hope it helps someone :)

I encountered a similar error when quantizing openbmb/MiniCPM-V-4_5。

Repository:https://github.com/tc-mb/AutoAWQ.git

Script:
(.venv-awq) root@dsw-624670-588555b759-rl5vr:/mnt/workspace/AutoAWQ# cat /mnt/data/export_awq.py
import os
from datasets import load_dataset, load_from_disk
from awq import AutoAWQForCausalLM
import torch
from transformers import AutoTokenizer
import shutil

Set the path to the original model (can be a local path or model ID)

model_path = '/mnt/workspace/checkpoints/MiniCPM-V-4_5'

Path to save the quantized model

quant_path = '/mnt/workspace/checkpoints/minicpmv4_5_awq'

Quantization configuration: 4-bit weights, group size 128, GEMM backend

quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" } # "w_bit":4 or 8

Load the original model and tokenizer

model = AutoAWQForCausalLM.from_pretrained(model_path, trust_remote_code=True, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

Copy files that exist in model_path but not in quant_path (excluding weight files)

def copy_files_not_in_B(A_path, B_path):
"""
Copies files from directory A to directory B if they exist in A but not in B.

:param A_path: Path to the source directory (A).
:param B_path: Path to the destination directory (B).
"""
# Ensure source directory exists
if not os.path.exists(A_path):
    raise FileNotFoundError(f"The directory {A_path} does not exist.")
if not os.path.exists(B_path):
    os.makedirs(B_path)

# List all files in directory A except weight files (e.g., .bin or safetensors)
files_in_A = os.listdir(A_path)
files_in_A = set([file for file in files_in_A if not (".bin" in file or "safetensors" in file )])
# List all files in directory B
files_in_B = set(os.listdir(B_path))

# Determine which files need to be copied
files_to_copy = files_in_A - files_in_B

# Copy each missing file from A to B
for file in files_to_copy:
    src_file = os.path.join(A_path, file)
    dst_file = os.path.join(B_path, file)
    if os.path.isfile(src_file):
        shutil.copy2(src_file, dst_file)

Define data loading methods

Load the Alpaca dataset

def load_alpaca():
data = load_dataset("tatsu-lab/alpaca", split="train")

# Convert each example into a chat-style prompt
def concatenate_data(x):
    if x['input'] and x['instruction']:
        msgs = [
                {"role": "system", "content": x['instruction']},
                {"role": "user", "content": x['input']},
                {"role": "assistant", "content": x['output']},
        ]
    elif x['input']:
        msgs = [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": x['input']},
            {"role": "assistant", "content": x['output']}
        ]
    else:
        msgs = [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": x['instruction']},
            {"role": "assistant", "content": x['output']}
        ]
    
    data = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
    return {"text": data}

concatenated = data.map(concatenate_data)
return [text for text in concatenated["text"]][:1024]

Load Wikitext dataset

def load_wikitext():
data = load_dataset('wikitext', 'wikitext-2-raw-v1', split="train")
return [text for text in data["text"] if text.strip() != '' and len(text.split(' ')) > 20]

Load calibration data

calib_data = load_alpaca()

Quantize

model.quantize(tokenizer, quant_config=quant_config, calib_data=calib_data)

shutil.rmtree(quant_path, ignore_errors=True)

Save the quantized model

model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

copy_files_not_in_B(model_path, quant_path)
print(f'Model is quantized and saved at "{quant_path}"')

(.venv-awq) root@dsw-624670-588555b759-rl5vr:/mnt/workspace/AutoAWQ# uv pip list
Using Python 3.10.18 environment at: /mnt/workspace/.venv-awq
Package Version Editable project location


accelerate 1.10.1
aiohappyeyeballs 2.6.1
aiohttp 3.12.15
aiosignal 1.4.0
async-timeout 5.0.1
attrs 25.3.0
autoawq 0.2.9 /mnt/workspace/AutoAWQ
certifi 2025.8.3
charset-normalizer 3.4.3
datasets 4.1.1
dill 0.4.0
filelock 3.19.1
frozenlist 1.7.0
fsspec 2025.9.0
hf-xet 1.1.10
huggingface-hub 0.35.3
idna 3.10
jinja2 3.1.6
markupsafe 3.0.3
mpmath 1.3.0
multidict 6.6.4
multiprocess 0.70.16
networkx 3.4.2
numpy 2.2.6
nvidia-cublas-cu12 12.4.5.8
nvidia-cuda-cupti-cu12 12.4.127
nvidia-cuda-nvrtc-cu12 12.4.127
nvidia-cuda-runtime-cu12 12.4.127
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.2.1.3
nvidia-cufile-cu12 1.13.1.3
nvidia-curand-cu12 10.3.5.147
nvidia-cusolver-cu12 11.6.1.9
nvidia-cusparse-cu12 12.3.1.170
nvidia-cusparselt-cu12 0.6.2
nvidia-nccl-cu12 2.21.5
nvidia-nvjitlink-cu12 12.4.127
nvidia-nvtx-cu12 12.4.127
packaging 25.0
pandas 2.3.3
pillow 11.3.0
pip 25.2
propcache 0.4.0
psutil 7.1.0
pyarrow 21.0.0
python-dateutil 2.9.0.post0
pytz 2025.2
pyyaml 6.0.3
regex 2025.9.18
requests 2.32.5
safetensors 0.6.2
setuptools 80.9.0
six 1.17.0
sympy 1.13.1
tokenizers 0.21.4
torch 2.6.0
torchvision 0.21.0
tqdm 4.67.1
transformers 4.51.3
triton 3.2.0
typing-extensions 4.15.0
tzdata 2025.2
urllib3 2.5.0
wheel 0.45.1
xxhash 3.6.0
yarl 1.20.1
zstandard 0.25.0
(.venv-awq) root@dsw-624670-588555b759-rl5vr:/mnt/workspace/AutoAWQ# python /mnt/data/export_awq.py
/mnt/workspace/AutoAWQ/awq/init.py:21: DeprecationWarning:
I have left this message as the final dev message to help you transition.

Important Notice:

  • AutoAWQ is officially deprecated and will no longer be maintained.
  • The last tested configuration used Torch 2.6.0 and Transformers 4.51.3.
  • If future versions of Transformers break AutoAWQ compatibility, please report the issue to the Transformers project.

Alternative:

For further inquiries, feel free to reach out:

Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:00<00:00, 22.13it/s]
Traceback (most recent call last):
File "/mnt/data/export_awq.py", line 93, in
model.quantize(tokenizer, quant_config=quant_config, calib_data=calib_data)
File "/mnt/workspace/.venv-awq/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/mnt/workspace/AutoAWQ/awq/models/base.py", line 225, in quantize
self.quantizer = quantizer_cls(
File "/mnt/workspace/AutoAWQ/awq/quantize/quantizer.py", line 70, in init
self.modules, self.module_kwargs, self.inps = self.init_quant(
File "/mnt/workspace/AutoAWQ/awq/quantize/quantizer.py", line 607, in init_quant
self.model(samples.to(next(self.model.parameters()).device))
File "/mnt/workspace/.venv-awq/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/mnt/workspace/.venv-awq/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/MiniCPM-V-4_5/modeling_minicpmv.py", line 206, in forward
vllm_embedding, vision_hidden_states = self.get_vllm_embedding(data)
File "/root/.cache/huggingface/modules/transformers_modules/MiniCPM-V-4_5/modeling_minicpmv.py", line 79, in get_vllm_embedding
if 'vision_hidden_states' not in data:
File "/mnt/workspace/.venv-awq/lib/python3.10/site-packages/torch/_tensor.py", line 1225, in contains
raise RuntimeError(
RuntimeError: Tensor.contains only supports Tensor or scalar, but you passed in a <class 'str'>.

Hi!
Today i spent few hours trying to make it work - unfortunately, this script from cookbook (method 3, manual quantization - I'm trying to quantize finetunned model) did not work, there was an error in same get_vllm_embedding function. Finally, I managed to fix this by pasting piece of code copied from MiniCPM-V-4 to modeling_minicpm.py file. At the beginning of forward() function I've added this data parser:

def forward(self, data, **kwargs):
    ### pasted from MiniCPM-V-4 
    if isinstance(data, torch.Tensor):
        attention_mask = torch.ones_like(data, dtype=torch.bool)
        kwargs = {'attention_mask': attention_mask}
        return self.llm(
            input_ids=data,
            **kwargs
        )

    if data is None:
        data = {
            "input_ids": kwargs.pop("input_ids", None),
            "pixel_values": kwargs.pop("pixel_values", None),
            "image_bound": kwargs.pop("image_bound", None),
            "tgt_sizes": kwargs.pop("tgt_sizes", None),
            "position_ids": kwargs.pop("position_ids", None),
        }
    else:
        kwargs.pop("input_ids", None)
        kwargs.pop("pixel_values", None)
        kwargs.pop("image_bound", None)
        kwargs.pop("tgt_sizes", None)
        kwargs.pop("position_ids", None)
    kwargs.pop("inputs_embeds", None)
    ### rest without changes
    vllm_embedding, vision_hidden_states = self.get_vllm_embedding(data)

Not sure if it was deleted for a reason, but now it works perfectly. Maybe it would also fix llm-compressor error I've mentioned before - to be tested.
Hope it helps someone :)

it works!

OpenBMB org

https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/quantization/awq/minicpm-v4_5_awq_quantize.md
You can refer to the documentation we provided. If you still have questions, I will reply as soon as possible after the holiday.

OpenBMB org

@bialykostek OK, I understand the whole process. thx ^_^

Sign up or log in to comment