--- license: unknown --- # Tensor Type Testing > [!TIP] > Skip to the bottom of this document for a TL;DR For more info, see [llama.cpp #12511: Handle user-defined quantization levels for additional tensors](https://github.com/ggml-org/llama.cpp/pull/12511) by @EAddario Testing done by @ddh0 using [this branch](https://github.com/EAddario/llama.cpp/tree/quantize) as of committ [5a304b8](https://github.com/EAddario/llama.cpp/commit/5a304b8e26b8c53f43e8d12515e52f9bb7d199f0). Using libllama built for Linux CUDA. ## Quantization naming scheme ``` Model-Name-E{TYPE_EMBD}-F{TYPE_FFN}-A{TYPE_ATTN}-O{TYPE_OUTPUT}.gguf ``` for example `Llama-3.1-8B-Instruct-EQ4_K-FQ4_K-AQ8_0-OQ8_0.gguf`: - Model is Llama 3.1 8B Instruct - TYPE_EMBD (token embeddings) are in Q4_K - TYPE_FFN (MLP / feed-forward tensors) are in Q4_K - TYPE_ATTN (K,Q,V attention and attention output tensors) are in Q8_0 - TYPE_OUTPUT (output tensor) is in Q8_0 --- ## Command template ```bash TYPE_EMBD=GGML_TYPE TYPE_FFN=GGML_TYPE TYPE_ATTN=GGML_TYPE TYPE_OUTPUT=GGML_TYPE SRC_GGUF=/my/model/orig.gguf DST_GGUF=/my/model/quant.gguf N_THREADS=4 ./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS ``` --- ## Commands used for Llama 3.2 --- ### Crush token embeddings to Q2_K, otherwise Q8_0 ```bash TYPE_EMBD=Q2_K TYPE_FFN=Q8_0 TYPE_ATTN=Q8_0 TYPE_OUTPUT=Q8_0 SRC_GGUF=/opt/workspace/gguf/Llama-3.2-3B-BF16.gguf DST_GGUF=/opt/workspace/gguf/Llama-3.2-3B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf N_THREADS=16 ./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS ``` --- ### Crush FFN to Q2_K, otherwise Q8_0 ```bash TYPE_EMBD=Q8_0 TYPE_FFN=Q2_K TYPE_ATTN=Q8_0 TYPE_OUTPUT=Q8_0 SRC_GGUF=/opt/workspace/gguf/Llama-3.2-3B-BF16.gguf DST_GGUF=/opt/workspace/gguf/Llama-3.2-3B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf N_THREADS=16 ./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS ``` --- ### Crush attention to Q2_K, otherwise Q8_0 ```bash TYPE_EMBD=Q8_0 TYPE_FFN=Q8_0 TYPE_ATTN=Q2_K TYPE_OUTPUT=Q8_0 SRC_GGUF=/opt/workspace/gguf/Llama-3.2-3B-BF16.gguf DST_GGUF=/opt/workspace/gguf/Llama-3.2-3B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf N_THREADS=16 ./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS ``` --- ### Crush output tensor to Q2_K, otherwise Q8_0 ⚠️ > **This quant was not included in the testing because Llama 3.2 3B has no output tensor! The resulting file is the same as a normal Q8_0.** ```bash TYPE_EMBD=Q8_0 TYPE_FFN=Q8_0 TYPE_ATTN=Q8_0 TYPE_OUTPUT=Q2_K SRC_GGUF=/opt/workspace/gguf/Llama-3.2-3B-BF16.gguf DST_GGUF=/opt/workspace/gguf/Llama-3.2-3B-EQ8_0-FQ8_0-AQ8_0-OQ2_K.gguf N_THREADS=16 ./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS ``` --- ## Raw results for Llama 3.2 3B ``` Number of input texts: 10 Shortest input length in tokens: 55 Longest input length in tokens: 4678 Average input length in tokens: 1605.5 Total number of input tokens: 16055 -------------------------------------------------------------------------------- Evaluating baseline model Llama-3.2-3B-BF16.gguf... Load model... Evaluate prompts... Unload model... -------------------------------------------------------------------------------- Now processing: Llama-3.2-3B-Q2_K.gguf Load model... Evaluate prompts... Unload model... Compute MSD... Mean-Squared Deviation - Llama-3.2-3B-BF16.gguf vs. Llama-3.2-3B-Q2_K.gguf: -- Prompt 0: 1.2261667251586914 -- Prompt 1: 1.1347604990005493 -- Prompt 2: 1.388033390045166 -- Prompt 3: 1.1053369045257568 -- Prompt 4: 1.7510676383972168 -- Prompt 5: 4.586221218109131 -- Prompt 6: 1.3651360273361206 -- Prompt 7: 0.8970077037811279 -- Prompt 8: 0.3409916162490845 -- Prompt 9: 1.2506738901138306 Average MSD: 1.5045396089553833 -------------------------------------------------------------------------------- Now processing: Llama-3.2-3B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf Load model... Evaluate prompts... Unload model... Compute MSD... Mean-Squared Deviation - Llama-3.2-3B-BF16.gguf vs. Llama-3.2-3B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf: -- Prompt 0: 0.3589555025100708 -- Prompt 1: 0.1420530527830124 -- Prompt 2: 0.3871675133705139 -- Prompt 3: 0.38336610794067383 -- Prompt 4: 0.4630553722381592 -- Prompt 5: 0.3928600549697876 -- Prompt 6: 0.46294596791267395 -- Prompt 7: 0.41983363032341003 -- Prompt 8: 0.0822080597281456 -- Prompt 9: 0.3548887372016907 Average MSD: 0.34473341703414917 -------------------------------------------------------------------------------- Now processing: Llama-3.2-3B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf Load model... Evaluate prompts... Unload model... Compute MSD... Mean-Squared Deviation - Llama-3.2-3B-BF16.gguf vs. Llama-3.2-3B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf: -- Prompt 0: 4.409396648406982 -- Prompt 1: 2.431891679763794 -- Prompt 2: 5.892056941986084 -- Prompt 3: 4.688146591186523 -- Prompt 4: 6.351741313934326 -- Prompt 5: 8.826679229736328 -- Prompt 6: 4.506043434143066 -- Prompt 7: 4.613113880157471 -- Prompt 8: 1.0596126317977905 -- Prompt 9: 4.1558661460876465 Average MSD: 4.693454742431641 -------------------------------------------------------------------------------- Now processing: Llama-3.2-3B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf Load model... Evaluate prompts... Unload model... Compute MSD... Mean-Squared Deviation - Llama-3.2-3B-BF16.gguf vs. Llama-3.2-3B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf: -- Prompt 0: 1.0618470907211304 -- Prompt 1: 1.1212399005889893 -- Prompt 2: 1.3122810125350952 -- Prompt 3: 0.9195016026496887 -- Prompt 4: 1.201547622680664 -- Prompt 5: 5.760651111602783 -- Prompt 6: 1.0914928913116455 -- Prompt 7: 0.9646959900856018 -- Prompt 8: 0.41648873686790466 -- Prompt 9: 1.4317259788513184 Average MSD: 1.5281471014022827 -------------------------------------------------------------------------------- Now processing: Llama-3.2-3B-Q8_0.gguf Load model... Evaluate prompts... Unload model... Compute MSD... Mean-Squared Deviation - Llama-3.2-3B-BF16.gguf vs. Llama-3.2-3B-Q8_0.gguf: -- Prompt 0: 0.0023212190717458725 -- Prompt 1: 0.0014450754970312119 -- Prompt 2: 0.003914575092494488 -- Prompt 3: 0.002514646854251623 -- Prompt 4: 0.003313937224447727 -- Prompt 5: 0.004224818665534258 -- Prompt 6: 0.0026909655425697565 -- Prompt 7: 0.0033839084208011627 -- Prompt 8: 0.0015104531776160002 -- Prompt 9: 0.002354747150093317 Average MSD: 0.0027674345765262842 -------------------------------------------------------------------------------- Average Mean-Squared Deviation compared to Llama-3.2-3B-BF16.gguf: -------------------------------------------------------------------------------- Llama-3.2-3B-Q2_K.gguf -- 1.5045396089553833 Llama-3.2-3B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf -- 0.34473341703414917 Llama-3.2-3B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf -- 4.693454742431641 Llama-3.2-3B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf -- 1.5281471014022827 Llama-3.2-3B-Q8_0.gguf -- 0.0027674345765262842 -------------------------------------------------------------------------------- ``` --- ## Commands used for Qwen2.5-14B --- ### Crush token embeddings to Q2_K, otherwise Q8_0 ```bash TYPE_EMBD=Q2_K TYPE_FFN=Q8_0 TYPE_ATTN=Q8_0 TYPE_OUTPUT=Q8_0 SRC_GGUF=/opt/workspace/gguf/Qwen2.5-14B-BF16.gguf DST_GGUF=/opt/workspace/gguf/Qwen2.5-14B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf N_THREADS=16 ./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS ``` --- ### Crush FFNs to Q2_K, otherwise Q8_0 ```bash TYPE_EMBD=Q8_0 TYPE_FFN=Q2_K TYPE_ATTN=Q8_0 TYPE_OUTPUT=Q8_0 SRC_GGUF=/opt/workspace/gguf/Qwen2.5-14B-BF16.gguf DST_GGUF=/opt/workspace/gguf/Qwen2.5-14B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf N_THREADS=16 ./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS ``` --- ### Crush attention to Q2_K, otherwise Q8_0 ```bash TYPE_EMBD=Q8_0 TYPE_FFN=Q8_0 TYPE_ATTN=Q2_K TYPE_OUTPUT=Q8_0 SRC_GGUF=/opt/workspace/gguf/Qwen2.5-14B-BF16.gguf DST_GGUF=/opt/workspace/gguf/Qwen2.5-14B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf N_THREADS=16 ./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS ``` --- ### Crush output tensor to Q2_K, otherwise Q8_0 ```bash TYPE_EMBD=Q8_0 TYPE_FFN=Q8_0 TYPE_ATTN=Q8_0 TYPE_OUTPUT=Q2_K SRC_GGUF=/opt/workspace/gguf/Qwen2.5-14B-BF16.gguf DST_GGUF=/opt/workspace/gguf/Qwen2.5-14B-EQ8_0-FQ8_0-AQ8_0-OQ2_K.gguf N_THREADS=16 ./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS ``` --- ## Raw results for Qwen2.5-14B ``` Number of input texts: 10 Shortest input length in tokens: 60 Longest input length in tokens: 4801 Average input length in tokens: 1589.3 Total number of input tokens: 15893 -------------------------------------------------------------------------------- Evaluating baseline model Qwen2.5-14B-BF16.gguf... Load model... Evaluate prompts... Unload model... -------------------------------------------------------------------------------- Now processing: Qwen2.5-14B-Q2_K.gguf Load model... Evaluate prompts... Unload model... Compute MSD... Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-Q2_K.gguf: -- Prompt 0: 1.568434476852417 -- Prompt 1: 1.8605916500091553 -- Prompt 2: 1.2912431955337524 -- Prompt 3: 1.3367090225219727 -- Prompt 4: 1.1364308595657349 -- Prompt 5: 2.3384993076324463 -- Prompt 6: 1.2926896810531616 -- Prompt 7: 1.4084643125534058 -- Prompt 8: 0.32443684339523315 -- Prompt 9: 1.3756331205368042 Average MSD: 1.3933132886886597 -------------------------------------------------------------------------------- Now processing: Qwen2.5-14B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf Load model... Evaluate prompts... Unload model... Compute MSD... Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf: -- Prompt 0: 0.012962134554982185 -- Prompt 1: 0.019185630604624748 -- Prompt 2: 0.05430002510547638 -- Prompt 3: 0.008174948394298553 -- Prompt 4: 0.011592703871428967 -- Prompt 5: 0.012105505913496017 -- Prompt 6: 0.007557644974440336 -- Prompt 7: 0.01957087405025959 -- Prompt 8: 0.013395288027822971 -- Prompt 9: 0.007488884497433901 Average MSD: 0.01663336530327797 -------------------------------------------------------------------------------- Now processing: Qwen2.5-14B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf Load model... Evaluate prompts... Unload model... Compute MSD... Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf: -- Prompt 0: 2.483222246170044 -- Prompt 1: 2.20788836479187 -- Prompt 2: 2.2648935317993164 -- Prompt 3: 2.175588607788086 -- Prompt 4: 1.624481439590454 -- Prompt 5: 4.104475498199463 -- Prompt 6: 2.0161893367767334 -- Prompt 7: 2.0660784244537354 -- Prompt 8: 0.46407243609428406 -- Prompt 9: 2.1939690113067627 Average MSD: 2.160086154937744 -------------------------------------------------------------------------------- Now processing: Qwen2.5-14B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf Load model... Evaluate prompts... Unload model... Compute MSD... Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf: -- Prompt 0: 0.7283403277397156 -- Prompt 1: 1.0912593603134155 -- Prompt 2: 0.9022651314735413 -- Prompt 3: 0.4880850911140442 -- Prompt 4: 0.29713207483291626 -- Prompt 5: 0.6994995474815369 -- Prompt 6: 0.45846545696258545 -- Prompt 7: 0.5286242365837097 -- Prompt 8: 0.2947601079940796 -- Prompt 9: 0.5722559690475464 Average MSD: 0.6060687303543091 -------------------------------------------------------------------------------- Now processing: Qwen2.5-14B-EQ8_0-FQ8_0-AQ8_0-OQ2_K.gguf Load model... Evaluate prompts... Unload model... Compute MSD... Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-EQ8_0-FQ8_0-AQ8_0-OQ2_K.gguf: -- Prompt 0: 1.2783535718917847 -- Prompt 1: 0.4481557607650757 -- Prompt 2: 1.1880418062210083 -- Prompt 3: 1.0997036695480347 -- Prompt 4: 0.8093082308769226 -- Prompt 5: 0.6486296057701111 -- Prompt 6: 1.1238276958465576 -- Prompt 7: 1.1459368467330933 -- Prompt 8: 0.23579858243465424 -- Prompt 9: 1.238993525505066 Average MSD: 0.9216748476028442 -------------------------------------------------------------------------------- Now processing: Qwen2.5-14B-Q8_0.gguf Load model... Evaluate prompts... Unload model... Compute MSD... Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-Q8_0.gguf: -- Prompt 0: 0.0059487177059054375 -- Prompt 1: 0.004823403432965279 -- Prompt 2: 0.011750683188438416 -- Prompt 3: 0.004459250718355179 -- Prompt 4: 0.004037810489535332 -- Prompt 5: 0.0039064036682248116 -- Prompt 6: 0.004684466868638992 -- Prompt 7: 0.004520604852586985 -- Prompt 8: 0.004727284424006939 -- Prompt 9: 0.004541514907032251 Average MSD: 0.0053400141187012196 -------------------------------------------------------------------------------- Average Mean-Squared Deviation compared to Qwen2.5-14B-BF16.gguf: -------------------------------------------------------------------------------- Qwen2.5-14B-Q2_K.gguf -- 1.3933132886886597 Qwen2.5-14B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf -- 0.01663336530327797 Qwen2.5-14B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf -- 2.160086154937744 Qwen2.5-14B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf -- 0.6060687303543091 Qwen2.5-14B-EQ8_0-FQ8_0-AQ8_0-OQ2_K.gguf -- 0.9216748476028442 Qwen2.5-14B-Q8_0.gguf -- 0.0053400141187012196 -------------------------------------------------------------------------------- ``` --- ## TL;DR Mean-Squared Deviation as compared to BF16, average over 10 inputs (lower is better): | | Q2_K | Crush TYPE_EMBD | Crush TYPE_FFN | Crush TYPE_ATTN | Crush TYPE_OUTPUT | Q8_0 | | ------------ | -------- | --------------- | -------------- | --------------- | ----------------- | ---------- | | Llama 3.2 3B | 1.504 | 0.344 | 4.693 | 1.528 | N/A | 0.002 | | Qwen2.5-14B | 1.393 | 0.016 | 2.160 | 0.606 | 0.921 | 0.005 | | **Average** | **1.44** | **0.18** | **3.42** | **1.06** | **0.921** | **0.0035** | In short, we can see that aggressive quantization of the FFN tensors causes the greatest deviation from BF16, and aggressive quantization of the token embeddings causes the least deviation. Note that deviations greater than ~0.1 start to have a noticeable effect on the quality of the model's output. Realistically, it's probably wise to stick to any combination of Q3_K, Q4_K, Q5_K, Q6_K, and Q8_0 depending on your situation.