q4 model size is bigger than int8 - why?

#4
by sainishanthvetsa - opened

@Xenova @hf

Can you please give some clarity on why int4 quantized models are having higher model size compared to int8 models on ONNX. We observed this for Llama3.2 1B as well as Gemma3 1B.

ONNX Community org
β€’
edited Oct 20

Hi πŸ‘‹ The main reason why is that this q4 model only quantizes the MatMul weights, while Gather nodes are left unquantized. This is because the current version of Transformers.js (v3.x) doesn't support 4-bit gather operations... However, starting with Transformers.js v4 (currently in developer preview), you'll be able to perform 4-bit gathers, and the weights will be significantly smaller.

Thank you for the response @Xenova

sainishanthvetsa changed discussion status to closed

Sign up or log in to comment