Failing to quantize using your method

#4
by redd2dead - opened

Hello,
Due to working in an air gapped environment I am unable to simply download the model to use, so I tried to follow your instructions under Creation to achieve a similar result.
Unfortunately it's been rather unsuccessful with some bizarre issues:

It seems like most model layers do get quantized, except for layers named language_model.model.layers.<layer_num>.feed_forward.experts.gate_up_proj.
These layers are apparently Parameters, and as such don't belong to a particular Module, and can't be quantized as any other layer (according to my research but I could be wrong).

Moreover, I initially attempted quantization using the vLLM docs, where the following is stated:

from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

# Configure the simple PTQ quantization
recipe = QuantizationModifier(
  targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"])

# Apply the quantization algorithm.
oneshot(model=model, recipe=recipe)

# Save the model: Meta-Llama-3-8B-Instruct-FP8-Dynamic
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
model.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)

However the Creation steps were a bit different:

    recipe = QuantizationModifier(
        targets="Linear",
        config_groups={"group_0": quant_scheme},
        ignore=[
            're:.*lm_head',
            're:.*self_attn',
            're:.*router',
            're:.*vision_model',
            're:.*multi_modal_projector',
        ]
    )

Using the quant_scheme defined earlier. Using this method resulted in the code simply hanging, but after adding scheme='FP8_DYNAMIC' to the recipe it did run but resulted in the same barely quantized model with the non quantized layers. When I say barely quantized, the resulted model takes up approximately 10-15GB less than the original one.
I would appreciate any input on this.

Sign up or log in to comment