Failing to quantize using your method
Hello,
Due to working in an air gapped environment I am unable to simply download the model to use, so I tried to follow your instructions under Creation to achieve a similar result.
Unfortunately it's been rather unsuccessful with some bizarre issues:
It seems like most model layers do get quantized, except for layers named language_model.model.layers.<layer_num>.feed_forward.experts.gate_up_proj
.
These layers are apparently Parameters, and as such don't belong to a particular Module, and can't be quantized as any other layer (according to my research but I could be wrong).
Moreover, I initially attempted quantization using the vLLM docs, where the following is stated:
from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
# Configure the simple PTQ quantization
recipe = QuantizationModifier(
targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"])
# Apply the quantization algorithm.
oneshot(model=model, recipe=recipe)
# Save the model: Meta-Llama-3-8B-Instruct-FP8-Dynamic
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
model.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)
However the Creation steps were a bit different:
recipe = QuantizationModifier(
targets="Linear",
config_groups={"group_0": quant_scheme},
ignore=[
're:.*lm_head',
're:.*self_attn',
're:.*router',
're:.*vision_model',
're:.*multi_modal_projector',
]
)
Using the quant_scheme
defined earlier. Using this method resulted in the code simply hanging, but after adding scheme='FP8_DYNAMIC'
to the recipe
it did run but resulted in the same barely quantized model with the non quantized layers. When I say barely quantized, the resulted model takes up approximately 10-15GB less than the original one.
I would appreciate any input on this.