q4 model size is bigger than int8 - why?

by sainishanthvetsa - opened Oct 16

Oct 16

•

Can you please give some clarity on why int4 quantized models are having higher model size compared to int8 models on ONNX. We observed this for Llama3.2 1B as well as Gemma3 1B.

Xenova

ONNX Community org Oct 20

•

edited Oct 20

Hi 👋 The main reason why is that this q4 model only quantizes the MatMul weights, while Gather nodes are left unquantized. This is because the current version of Transformers.js (v3.x) doesn't support 4-bit gather operations... However, starting with Transformers.js v4 (currently in developer preview), you'll be able to perform 4-bit gathers, and the weights will be significantly smaller.

sainishanthvetsa

Oct 20

Thank you for the response @Xenova

sainishanthvetsa changed discussion status to closed Oct 20

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment