Update

I've improved the quality of the model, but size increased from 571MiB to 624MiB.

There's now only a ~1% difference in retrieval performance between this model and the full f32 model.

This model is ~6% more accurate at retrieval than the onnx-community uint8 model with f32 output.

This model is somewhere around 3.5% more accurate at retrieval than the previous version of this model.

Inference speed was similar on my hardware vs. previous model (Ryzen CPU).

Qwen3-Embedding-0.6B-onnx-uint8

This is an onnx version of https://huggingface.co/Qwen/Qwen3-Embedding-0.6B

This model has been dynamically quantized to uint8, and further modified to output a uint8 1024 dim tensor.

This model is compatible with qdrant fastembed, please note these details:

  • Execute model without pooling and without normalization
  • Pay attention to the example query format in the code below

Quantization method

I created a little onnx model instrumentation framework to assist in quantization. I generated calibration data, created an instrumented onnx model, and recorded the range of values for every tensor in the model during inference. I tested different criteria for excluding nodes until I settled on what I felt was a good size/accuracy tradeoff. I ended up excluding 484 of the most sensitive nodes from quantization.

After that I generated 1 million tokens of calibration data and recorded the range of float32 outputs seen during inference.

The range I found: -0.3009805381298065 to 0.3952634334564209

I used that range for an assymmetric linear quantization from float32 -> uint8.

Here are the nodes I excluded
["/0/auto_model/ConstantOfShape",
"/0/auto_model/Constant_28",
"/0/auto_model/layers.25/post_attention_layernorm/Pow",
"/0/auto_model/layers.26/input_layernorm/Pow",
"/0/auto_model/layers.25/input_layernorm/Pow",
"/0/auto_model/layers.24/post_attention_layernorm/Pow",
"/0/auto_model/layers.24/input_layernorm/Pow",
"/0/auto_model/layers.23/post_attention_layernorm/Pow",
"/0/auto_model/layers.23/input_layernorm/Pow",
"/0/auto_model/layers.22/post_attention_layernorm/Pow",
"/0/auto_model/layers.22/input_layernorm/Pow",
"/0/auto_model/layers.3/input_layernorm/Pow",
"/0/auto_model/layers.4/input_layernorm/Pow",
"/0/auto_model/layers.3/post_attention_layernorm/Pow",
"/0/auto_model/layers.21/post_attention_layernorm/Pow",
"/0/auto_model/layers.5/input_layernorm/Pow",
"/0/auto_model/layers.4/post_attention_layernorm/Pow",
"/0/auto_model/layers.5/post_attention_layernorm/Pow",
"/0/auto_model/layers.6/input_layernorm/Pow",
"/0/auto_model/layers.6/post_attention_layernorm/Pow",
"/0/auto_model/layers.7/input_layernorm/Pow",
"/0/auto_model/layers.8/input_layernorm/Pow",
"/0/auto_model/layers.7/post_attention_layernorm/Pow",
"/0/auto_model/layers.26/post_attention_layernorm/Pow",
"/0/auto_model/layers.9/input_layernorm/Pow",
"/0/auto_model/layers.8/post_attention_layernorm/Pow",
"/0/auto_model/layers.21/input_layernorm/Pow",
"/0/auto_model/layers.20/post_attention_layernorm/Pow",
"/0/auto_model/layers.9/post_attention_layernorm/Pow",
"/0/auto_model/layers.10/input_layernorm/Pow",
"/0/auto_model/layers.20/input_layernorm/Pow",
"/0/auto_model/layers.11/input_layernorm/Pow",
"/0/auto_model/layers.10/post_attention_layernorm/Pow",
"/0/auto_model/layers.12/input_layernorm/Pow",
"/0/auto_model/layers.11/post_attention_layernorm/Pow",
"/0/auto_model/layers.12/post_attention_layernorm/Pow",
"/0/auto_model/layers.13/input_layernorm/Pow",
"/0/auto_model/layers.19/post_attention_layernorm/Pow",
"/0/auto_model/layers.13/post_attention_layernorm/Pow",
"/0/auto_model/layers.14/input_layernorm/Pow",
"/0/auto_model/layers.19/input_layernorm/Pow",
"/0/auto_model/layers.18/post_attention_layernorm/Pow",
"/0/auto_model/layers.14/post_attention_layernorm/Pow",
"/0/auto_model/layers.15/input_layernorm/Pow",
"/0/auto_model/layers.16/input_layernorm/Pow",
"/0/auto_model/layers.15/post_attention_layernorm/Pow",
"/0/auto_model/layers.18/input_layernorm/Pow",
"/0/auto_model/layers.17/post_attention_layernorm/Pow",
"/0/auto_model/layers.17/input_layernorm/Pow",
"/0/auto_model/layers.16/post_attention_layernorm/Pow",
"/0/auto_model/layers.27/post_attention_layernorm/Pow",
"/0/auto_model/layers.27/input_layernorm/Pow",
"/0/auto_model/norm/Pow",
"/0/auto_model/layers.25/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.25/post_attention_layernorm/Add",
"/0/auto_model/layers.26/input_layernorm/Add",
"/0/auto_model/layers.26/input_layernorm/ReduceMean",
"/0/auto_model/layers.25/input_layernorm/ReduceMean",
"/0/auto_model/layers.25/input_layernorm/Add",
"/0/auto_model/layers.24/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.24/post_attention_layernorm/Add",
"/0/auto_model/layers.24/input_layernorm/Add",
"/0/auto_model/layers.24/input_layernorm/ReduceMean",
"/0/auto_model/layers.23/post_attention_layernorm/Add",
"/0/auto_model/layers.23/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.23/input_layernorm/ReduceMean",
"/0/auto_model/layers.23/input_layernorm/Add",
"/0/auto_model/layers.22/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.22/post_attention_layernorm/Add",
"/0/auto_model/layers.26/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.26/post_attention_layernorm/Add",
"/0/auto_model/layers.22/input_layernorm/ReduceMean",
"/0/auto_model/layers.22/input_layernorm/Add",
"/0/auto_model/layers.3/input_layernorm/Add",
"/0/auto_model/layers.3/input_layernorm/ReduceMean",
"/0/auto_model/layers.21/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.21/post_attention_layernorm/Add",
"/0/auto_model/layers.4/input_layernorm/Add",
"/0/auto_model/layers.4/input_layernorm/ReduceMean",
"/0/auto_model/layers.3/post_attention_layernorm/Add",
"/0/auto_model/layers.3/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.5/input_layernorm/Add",
"/0/auto_model/layers.5/input_layernorm/ReduceMean",
"/0/auto_model/layers.4/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.4/post_attention_layernorm/Add",
"/0/auto_model/layers.5/post_attention_layernorm/Add",
"/0/auto_model/layers.5/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.6/input_layernorm/Add",
"/0/auto_model/layers.6/input_layernorm/ReduceMean",
"/0/auto_model/layers.6/post_attention_layernorm/Add",
"/0/auto_model/layers.6/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.7/input_layernorm/Add",
"/0/auto_model/layers.7/input_layernorm/ReduceMean",
"/0/auto_model/layers.8/input_layernorm/ReduceMean",
"/0/auto_model/layers.8/input_layernorm/Add",
"/0/auto_model/layers.7/post_attention_layernorm/Add",
"/0/auto_model/layers.7/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.9/input_layernorm/Add",
"/0/auto_model/layers.9/input_layernorm/ReduceMean",
"/0/auto_model/layers.8/post_attention_layernorm/Add",
"/0/auto_model/layers.8/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.21/input_layernorm/Add",
"/0/auto_model/layers.21/input_layernorm/ReduceMean",
"/0/auto_model/layers.20/post_attention_layernorm/Add",
"/0/auto_model/layers.20/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.9/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.9/post_attention_layernorm/Add",
"/0/auto_model/layers.10/input_layernorm/ReduceMean",
"/0/auto_model/layers.10/input_layernorm/Add",
"/0/auto_model/layers.20/input_layernorm/Add",
"/0/auto_model/layers.20/input_layernorm/ReduceMean",
"/0/auto_model/layers.11/input_layernorm/ReduceMean",
"/0/auto_model/layers.11/input_layernorm/Add",
"/0/auto_model/layers.10/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.10/post_attention_layernorm/Add",
"/0/auto_model/layers.12/input_layernorm/ReduceMean",
"/0/auto_model/layers.12/input_layernorm/Add",
"/0/auto_model/layers.11/post_attention_layernorm/Add",
"/0/auto_model/layers.11/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.12/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.12/post_attention_layernorm/Add",
"/0/auto_model/layers.13/input_layernorm/Add",
"/0/auto_model/layers.13/input_layernorm/ReduceMean",
"/0/auto_model/layers.19/post_attention_layernorm/Add",
"/0/auto_model/layers.19/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.13/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.13/post_attention_layernorm/Add",
"/0/auto_model/layers.14/input_layernorm/Add",
"/0/auto_model/layers.14/input_layernorm/ReduceMean",
"/0/auto_model/layers.19/input_layernorm/ReduceMean",
"/0/auto_model/layers.19/input_layernorm/Add",
"/0/auto_model/layers.18/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.18/post_attention_layernorm/Add",
"/0/auto_model/layers.14/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.14/post_attention_layernorm/Add",
"/0/auto_model/layers.15/input_layernorm/ReduceMean",
"/0/auto_model/layers.15/input_layernorm/Add",
"/0/auto_model/layers.16/input_layernorm/Add",
"/0/auto_model/layers.16/input_layernorm/ReduceMean",
"/0/auto_model/layers.15/post_attention_layernorm/Add",
"/0/auto_model/layers.15/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.18/input_layernorm/Add",
"/0/auto_model/layers.18/input_layernorm/ReduceMean",
"/0/auto_model/layers.17/post_attention_layernorm/Add",
"/0/auto_model/layers.17/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.17/input_layernorm/ReduceMean",
"/0/auto_model/layers.17/input_layernorm/Add",
"/0/auto_model/layers.16/post_attention_layernorm/Add",
"/0/auto_model/layers.16/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.27/post_attention_layernorm/Add",
"/0/auto_model/layers.27/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.27/input_layernorm/Add",
"/0/auto_model/layers.27/input_layernorm/ReduceMean",
"/0/auto_model/layers.27/self_attn/q_norm/Pow",
"/0/auto_model/layers.14/self_attn/k_norm/Pow",
"/0/auto_model/layers.26/self_attn/q_norm/Pow",
"/0/auto_model/layers.25/self_attn/q_norm/Pow",
"/0/auto_model/layers.26/self_attn/k_norm/Pow",
"/0/auto_model/layers.8/self_attn/k_norm/Pow",
"/0/auto_model/layers.24/self_attn/k_norm/Pow",
"/0/auto_model/layers.24/self_attn/q_norm/Pow",
"/0/auto_model/layers.25/self_attn/k_norm/Pow",
"/0/auto_model/layers.23/self_attn/q_norm/Pow",
"/0/auto_model/layers.27/self_attn/k_norm/Pow",
"/0/auto_model/layers.12/self_attn/k_norm/Pow",
"/0/auto_model/layers.13/self_attn/k_norm/Pow",
"/0/auto_model/layers.2/mlp/down_proj/MatMul",
"/0/auto_model/layers.3/post_attention_layernorm/Cast",
"/0/auto_model/layers.3/Add",
"/0/auto_model/layers.3/Add_1",
"/0/auto_model/layers.4/input_layernorm/Cast",
"/0/auto_model/layers.3/input_layernorm/Cast",
"/0/auto_model/layers.2/Add_1",
"/0/auto_model/layers.4/Add",
"/0/auto_model/layers.4/post_attention_layernorm/Cast",
"/0/auto_model/layers.5/input_layernorm/Cast",
"/0/auto_model/layers.4/Add_1",
"/0/auto_model/layers.5/post_attention_layernorm/Cast",
"/0/auto_model/layers.5/Add",
"/0/auto_model/layers.5/Add_1",
"/0/auto_model/layers.6/input_layernorm/Cast",
"/0/auto_model/layers.7/Add_1",
"/0/auto_model/layers.8/input_layernorm/Cast",
"/0/auto_model/layers.7/Add",
"/0/auto_model/layers.7/post_attention_layernorm/Cast",
"/0/auto_model/layers.6/Add",
"/0/auto_model/layers.6/post_attention_layernorm/Cast",
"/0/auto_model/layers.6/Add_1",
"/0/auto_model/layers.7/input_layernorm/Cast",
"/0/auto_model/layers.8/Add",
"/0/auto_model/layers.8/post_attention_layernorm/Cast",
"/0/auto_model/layers.9/input_layernorm/Cast",
"/0/auto_model/layers.8/Add_1",
"/0/auto_model/layers.9/post_attention_layernorm/Cast",
"/0/auto_model/layers.9/Add",
"/0/auto_model/layers.9/Add_1",
"/0/auto_model/layers.10/input_layernorm/Cast",
"/0/auto_model/layers.11/input_layernorm/Cast",
"/0/auto_model/layers.10/Add_1",
"/0/auto_model/layers.10/Add",
"/0/auto_model/layers.10/post_attention_layernorm/Cast",
"/0/auto_model/layers.11/Add",
"/0/auto_model/layers.11/post_attention_layernorm/Cast",
"/0/auto_model/layers.11/Add_1",
"/0/auto_model/layers.12/input_layernorm/Cast",
"/0/auto_model/layers.12/Add",
"/0/auto_model/layers.12/post_attention_layernorm/Cast",
"/0/auto_model/layers.12/Add_1",
"/0/auto_model/layers.13/input_layernorm/Cast",
"/0/auto_model/layers.13/Add",
"/0/auto_model/layers.13/post_attention_layernorm/Cast",
"/0/auto_model/layers.14/input_layernorm/Cast",
"/0/auto_model/layers.13/Add_1",
"/0/auto_model/layers.14/Add_1",
"/0/auto_model/layers.15/input_layernorm/Cast",
"/0/auto_model/layers.14/post_attention_layernorm/Cast",
"/0/auto_model/layers.14/Add",
"/0/auto_model/layers.15/post_attention_layernorm/Cast",
"/0/auto_model/layers.15/Add_1",
"/0/auto_model/layers.16/input_layernorm/Cast",
"/0/auto_model/layers.15/Add",
"/0/auto_model/layers.17/input_layernorm/Cast",
"/0/auto_model/layers.16/Add_1",
"/0/auto_model/layers.16/Add",
"/0/auto_model/layers.16/post_attention_layernorm/Cast",
"/0/auto_model/layers.19/input_layernorm/Cast",
"/0/auto_model/layers.18/Add_1",
"/0/auto_model/layers.18/input_layernorm/Cast",
"/0/auto_model/layers.17/Add_1",
"/0/auto_model/layers.17/Add",
"/0/auto_model/layers.17/post_attention_layernorm/Cast",
"/0/auto_model/layers.18/post_attention_layernorm/Cast",
"/0/auto_model/layers.18/Add",
"/0/auto_model/layers.19/Add",
"/0/auto_model/layers.19/post_attention_layernorm/Cast",
"/0/auto_model/layers.22/Add_1",
"/0/auto_model/layers.23/input_layernorm/Cast",
"/0/auto_model/layers.20/Add_1",
"/0/auto_model/layers.21/input_layernorm/Cast",
"/0/auto_model/layers.21/Add_1",
"/0/auto_model/layers.22/input_layernorm/Cast",
"/0/auto_model/layers.19/Add_1",
"/0/auto_model/layers.20/input_layernorm/Cast",
"/0/auto_model/layers.24/input_layernorm/Cast",
"/0/auto_model/layers.23/Add_1",
"/0/auto_model/layers.22/Add",
"/0/auto_model/layers.22/post_attention_layernorm/Cast",
"/0/auto_model/layers.21/Add",
"/0/auto_model/layers.21/post_attention_layernorm/Cast",
"/0/auto_model/layers.20/Add",
"/0/auto_model/layers.20/post_attention_layernorm/Cast",
"/0/auto_model/layers.23/post_attention_layernorm/Cast",
"/0/auto_model/layers.23/Add",
"/0/auto_model/layers.25/input_layernorm/Cast",
"/0/auto_model/layers.24/Add_1",
"/0/auto_model/layers.24/post_attention_layernorm/Cast",
"/0/auto_model/layers.24/Add",
"/0/auto_model/layers.25/Add",
"/0/auto_model/layers.25/post_attention_layernorm/Cast",
"/0/auto_model/layers.25/Add_1",
"/0/auto_model/layers.26/input_layernorm/Cast",
"/0/auto_model/layers.26/Add",
"/0/auto_model/layers.26/post_attention_layernorm/Cast",
"/0/auto_model/layers.21/self_attn/q_norm/Pow",
"/0/auto_model/layers.26/Add_1",
"/0/auto_model/layers.27/input_layernorm/Cast",
"/0/auto_model/layers.27/Add",
"/0/auto_model/layers.27/post_attention_layernorm/Cast",
"/0/auto_model/norm/Add",
"/0/auto_model/norm/ReduceMean",
"/0/auto_model/layers.23/self_attn/k_norm/Pow",
"/0/auto_model/layers.21/self_attn/k_norm/Pow",
"/0/auto_model/layers.22/self_attn/k_norm/Pow",
"/0/auto_model/layers.10/self_attn/k_norm/Pow",
"/0/auto_model/layers.19/self_attn/q_norm/Pow",
"/0/auto_model/layers.2/mlp/Mul",
"/0/auto_model/layers.22/self_attn/q_norm/Pow",
"/0/auto_model/layers.11/self_attn/k_norm/Pow",
"/0/auto_model/layers.20/self_attn/q_norm/Pow",
"/0/auto_model/layers.20/self_attn/k_norm/Pow",
"/0/auto_model/layers.18/self_attn/q_norm/Pow",
"/0/auto_model/layers.17/self_attn/q_norm/Pow",
"/0/auto_model/layers.27/mlp/down_proj/MatMul",
"/0/auto_model/layers.19/self_attn/k_norm/Pow",
"/0/auto_model/layers.27/Add_1",
"/0/auto_model/norm/Cast",
"/0/auto_model/layers.16/self_attn/k_norm/Pow",
"/0/auto_model/layers.18/self_attn/k_norm/Pow",
"/0/auto_model/layers.11/self_attn/q_norm/Pow",
"/0/auto_model/layers.9/self_attn/q_norm/Pow",
"/0/auto_model/layers.26/self_attn/q_norm/Add",
"/0/auto_model/layers.26/self_attn/q_norm/ReduceMean",
"/0/auto_model/layers.14/self_attn/k_norm/Add",
"/0/auto_model/layers.14/self_attn/k_norm/ReduceMean",
"/0/auto_model/layers.16/self_attn/q_norm/Pow",
"/0/auto_model/layers.27/mlp/Mul",
"/0/auto_model/layers.27/self_attn/q_norm/ReduceMean",
"/0/auto_model/layers.27/self_attn/q_norm/Add",
"/0/auto_model/layers.9/self_attn/k_norm/Pow",
"/0/auto_model/layers.17/self_attn/k_norm/Pow",
"/0/auto_model/layers.26/self_attn/k_norm/ReduceMean",
"/0/auto_model/layers.26/self_attn/k_norm/Add",
"/0/auto_model/layers.25/self_attn/k_norm/Add",
"/0/auto_model/layers.25/self_attn/k_norm/ReduceMean",
"/0/auto_model/layers.13/self_attn/k_norm/Add",
"/0/auto_model/layers.13/self_attn/k_norm/ReduceMean",
"/0/auto_model/layers.10/self_attn/q_norm/Pow",
"/0/auto_model/layers.25/input_layernorm/Mul_1",
"/0/auto_model/layers.27/self_attn/k_norm/ReduceMean",
"/0/auto_model/layers.27/self_attn/k_norm/Add",
"/0/auto_model/layers.26/input_layernorm/Mul_1",
"/0/auto_model/layers.15/self_attn/q_norm/Pow",
"/0/auto_model/layers.12/self_attn/k_norm/Add",
"/0/auto_model/layers.12/self_attn/k_norm/ReduceMean",
"/0/auto_model/layers.25/self_attn/q_norm/Add",
"/0/auto_model/layers.25/self_attn/q_norm/ReduceMean",
"/0/auto_model/layers.24/input_layernorm/Mul_1",
"/0/auto_model/layers.12/self_attn/q_norm/Pow",
"/0/auto_model/layers.24/self_attn/q_norm/ReduceMean",
"/0/auto_model/layers.24/self_attn/q_norm/Add",
"/0/auto_model/layers.24/self_attn/k_norm/ReduceMean",
"/0/auto_model/layers.24/self_attn/k_norm/Add",
"/0/auto_model/layers.22/mlp/Mul",
"/0/auto_model/layers.2/post_attention_layernorm/Pow",
"/0/auto_model/layers.23/mlp/Mul",
"/0/auto_model/layers.24/mlp/Mul",
"/0/auto_model/layers.23/input_layernorm/Mul_1",
"/0/auto_model/layers.14/self_attn/q_norm/Pow",
"/0/auto_model/layers.14/self_attn/k_proj/MatMul",
"/0/auto_model/layers.14/self_attn/k_norm/Cast",
"/0/auto_model/layers.14/self_attn/Reshape_1",
"/0/auto_model/layers.21/mlp/Mul",
"/0/auto_model/layers.3/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.3/input_layernorm/Sqrt",
"/0/auto_model/layers.4/input_layernorm/Sqrt",
"/0/auto_model/layers.5/input_layernorm/Sqrt",
"/0/auto_model/layers.4/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.5/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.6/input_layernorm/Sqrt",
"/0/auto_model/layers.6/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.8/input_layernorm/Sqrt",
"/0/auto_model/layers.8/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.7/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.7/input_layernorm/Sqrt",
"/0/auto_model/layers.9/input_layernorm/Sqrt",
"/0/auto_model/layers.10/input_layernorm/Sqrt",
"/0/auto_model/layers.9/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.11/input_layernorm/Sqrt",
"/0/auto_model/layers.10/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.12/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.11/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.12/input_layernorm/Sqrt",
"/0/auto_model/layers.13/input_layernorm/Sqrt",
"/0/auto_model/layers.14/input_layernorm/Sqrt",
"/0/auto_model/layers.13/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.15/input_layernorm/Sqrt",
"/0/auto_model/layers.14/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.16/input_layernorm/Sqrt",
"/0/auto_model/layers.15/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.17/input_layernorm/Sqrt",
"/0/auto_model/layers.16/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.19/input_layernorm/Sqrt",
"/0/auto_model/layers.17/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.18/input_layernorm/Sqrt",
"/0/auto_model/layers.18/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.19/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.23/input_layernorm/Sqrt",
"/0/auto_model/layers.20/input_layernorm/Sqrt",
"/0/auto_model/layers.21/input_layernorm/Sqrt",
"/0/auto_model/layers.22/input_layernorm/Sqrt",
"/0/auto_model/layers.22/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.24/input_layernorm/Sqrt",
"/0/auto_model/layers.20/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.21/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.23/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.25/input_layernorm/Sqrt",
"/0/auto_model/layers.24/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.25/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.26/input_layernorm/Sqrt",
"/0/auto_model/layers.26/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.15/self_attn/k_norm/Pow",
"/0/auto_model/layers.27/input_layernorm/Sqrt",
"/0/auto_model/layers.27/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.2/input_layernorm/Pow",
"/0/auto_model/layers.26/mlp/Mul",
"/0/auto_model/layers.23/self_attn/q_norm/Add",
"/0/auto_model/layers.23/self_attn/q_norm/ReduceMean",
"/0/auto_model/layers.13/self_attn/q_norm/Pow",
"/0/auto_model/layers.21/self_attn/q_norm/Add",
"/0/auto_model/layers.21/self_attn/q_norm/ReduceMean",
"/0/auto_model/layers.6/self_attn/q_norm/Pow",
"/0/auto_model/layers.27/self_attn/Reshape_7",
"/0/auto_model/layers.27/self_attn/MatMul_1",
"/0/auto_model/layers.27/self_attn/Transpose_4",
"/0/auto_model/layers.26/self_attn/Expand_1",
"/0/auto_model/layers.26/self_attn/Unsqueeze_19",
"/0/auto_model/layers.26/self_attn/v_proj/MatMul",
"/0/auto_model/layers.26/self_attn/Transpose_2",
"/0/auto_model/layers.26/self_attn/Reshape_6",
"/0/auto_model/layers.26/self_attn/Reshape_2",
"/0/auto_model/layers.11/self_attn/k_norm/ReduceMean",
"/0/auto_model/layers.11/self_attn/k_norm/Add",
"/0/auto_model/layers.22/input_layernorm/Mul_1",
"/0/auto_model/layers.25/mlp/Mul",
"/0/auto_model/layers.8/self_attn/k_norm/Cast",
"/0/auto_model/layers.8/self_attn/k_proj/MatMul",
"/0/auto_model/layers.8/self_attn/Reshape_1",
"/0/auto_model/layers.21/input_layernorm/Mul_1",
"/0/auto_model/layers.5/self_attn/q_norm/Pow",
"/0/auto_model/layers.22/self_attn/q_norm/ReduceMean",
"/0/auto_model/layers.22/self_attn/q_norm/Add",
"/0/auto_model/layers.22/mlp/down_proj/MatMul",
"/0/auto_model/layers.23/self_attn/k_norm/ReduceMean",
"/0/auto_model/layers.23/self_attn/k_norm/Add",
"/0/auto_model/layers.23/mlp/down_proj/MatMul",
"/0/auto_model/layers.26/mlp/down_proj/MatMul",
"/0/auto_model/layers.1/self_attn/Add_2",
"/0/auto_model/layers.2/self_attn/Add_2",
"/0/auto_model/layers.6/self_attn/Add_2",
"/0/auto_model/layers.11/self_attn/Add_2",
"/0/auto_model/layers.12/self_attn/Add_2",
"/0/auto_model/layers.16/self_attn/Add_2",
"/0/auto_model/layers.21/self_attn/Add_2",
"/0/auto_model/layers.24/self_attn/Add_2",
"/0/auto_model/layers.0/self_attn/Add_2",
"/0/auto_model/layers.8/self_attn/Add_2",
"/0/auto_model/layers.13/self_attn/Add_2",
"/0/auto_model/layers.26/self_attn/Add_2",
"/0/auto_model/layers.3/self_attn/Add_2",
"/0/auto_model/layers.15/self_attn/Add_2",
"/0/auto_model/layers.25/self_attn/Add_2",
"/0/auto_model/layers.4/self_attn/Add_2",
"/0/auto_model/layers.14/self_attn/Add_2",
"/0/auto_model/layers.22/self_attn/Add_2",
"/0/auto_model/layers.9/self_attn/Add_2",
"/0/auto_model/layers.23/self_attn/Add_2",
"/0/auto_model/layers.10/self_attn/Add_2",
"/0/auto_model/layers.5/self_attn/Add_2",
"/0/auto_model/layers.19/self_attn/Add_2",
"/0/auto_model/layers.7/self_attn/Add_2",
"/0/auto_model/layers.27/self_attn/Add_2",
"/0/auto_model/layers.18/self_attn/Add_2",
"/0/auto_model/layers.20/self_attn/Add_2",
"/0/auto_model/layers.17/self_attn/Add_2",
"/0/auto_model/Slice_1",
"/0/auto_model/layers.5/self_attn/Slice_4",
"/0/auto_model/layers.12/self_attn/Slice_4",
"/0/auto_model/layers.18/self_attn/Slice_4",
"/0/auto_model/layers.3/self_attn/Slice_4",
"/0/auto_model/layers.11/self_attn/Slice_4",
"/0/auto_model/layers.22/self_attn/Slice_4",
"/0/auto_model/Expand",
"/0/auto_model/layers.4/self_attn/Slice_4",
"/0/auto_model/Slice_2",
"/0/auto_model/layers.8/self_attn/Slice_4",
"/0/auto_model/layers.2/self_attn/Slice_4",
"/0/auto_model/layers.15/self_attn/Slice_4",
"/0/auto_model/layers.26/self_attn/Slice_4",
"/0/auto_model/layers.24/self_attn/Slice_4",
"/0/auto_model/Expand_1",
"/0/auto_model/layers.14/self_attn/Slice_4",
"/0/auto_model/layers.21/self_attn/Slice_4",
"/0/auto_model/layers.1/self_attn/Slice_4",
"/0/auto_model/Reshape_2",
"/0/auto_model/layers.19/self_attn/Slice_4",
"/0/auto_model/Slice",
"/0/auto_model/layers.6/self_attn/Slice_4",
"/0/auto_model/layers.0/self_attn/Slice_4",
"/0/auto_model/layers.25/self_attn/Slice_4",
"/0/auto_model/Unsqueeze_4",
"/0/auto_model/layers.10/self_attn/Slice_4",
"/0/auto_model/layers.23/self_attn/Slice_4",
"/0/auto_model/layers.17/self_attn/Slice_4",
"/0/auto_model/Where_1",
"/0/auto_model/layers.27/self_attn/Slice_4",
"/0/auto_model/layers.20/self_attn/Slice_4",
"/0/auto_model/Add",
"/0/auto_model/Mul",
"/0/auto_model/layers.7/self_attn/Slice_4",
"/0/auto_model/layers.13/self_attn/Slice_4",
"/0/auto_model/layers.9/self_attn/Slice_4",
"/0/auto_model/layers.16/self_attn/Slice_4",
"/0/auto_model/Unsqueeze_3",
"/0/auto_model/ScatterND"]

Benchmarks

Speed

Method = Big chunk of text x10 runs

Seconds elapsed for dynamic_int4.onnx: 45.37 (mixed int4/uint8 quantization)

Seconds elapsed for opt_f32.onnx: 46.07 (base f32 model preprocessed for quantization)

Seconds elapsed for dynamic_uint8.onnx: 34.61 (this model)

Verdict: This model is about 25% faster on my CPU compared to the base model.

Accuracy

I used beir-qdrant with the scifact dataset.

This retrieval benchmark isn't the greatest result.

I welcome any additional benchmarks by the community, please feel free to share any further results.

If someone wants to sponsor me with an NVIDIA GPU I can have a much faster turnaround time with my model experiments and explore some different quantization strategies.

onnx f32 model with f32 output (baseline):

ndcg: {'NDCG@1': 0.57, 'NDCG@3': 0.65655, 'NDCG@5': 0.68177, 'NDCG@10': 0.69999, 'NDCG@100': 0.72749, 'NDCG@1000': 0.73301}
recall: {'Recall@1': 0.53828, 'Recall@3': 0.71517, 'Recall@5': 0.77883, 'Recall@10': 0.83056, 'Recall@100': 0.95333, 'Recall@1000': 0.99667}
precision: {'P@1': 0.57, 'P@3': 0.26111, 'P@5': 0.17467, 'P@10': 0.09467, 'P@100': 0.01083, 'P@1000': 0.00113}

onnx dynamic uint8 model with f32 output (previous model's parent):

ndcg: {'NDCG@1': 0.52333, 'NDCG@3': 0.58087, 'NDCG@5': 0.59811, 'NDCG@10': 0.6249, 'NDCG@100': 0.66025, 'NDCG@1000': 0.67023}
recall: {'Recall@1': 0.4965, 'Recall@3': 0.62211, 'Recall@5': 0.66622, 'Recall@10': 0.74478, 'Recall@100': 0.90333, 'Recall@1000': 0.98}
precision: {'P@1': 0.52333, 'P@3': 0.22889, 'P@5': 0.15, 'P@10': 0.085, 'P@100': 0.0103, 'P@1000': 0.00111}

onnx dynamic uint8 model with uint8 output (previous model):

Note: This benchmarking better than it's parent is actually bad. I used more calibration data in the current version to avoid a repeat.

ndcg: {'NDCG@1': 0.52667, 'NDCG@3': 0.58478, 'NDCG@5': 0.60006, 'NDCG@10': 0.62646, 'NDCG@100': 0.66175, 'NDCG@1000': 0.67171}
recall: {'Recall@1': 0.49983, 'Recall@3': 0.62711, 'Recall@5': 0.66706, 'Recall@10': 0.74478, 'Recall@100': 0.90333, 'Recall@1000': 0.98}
precision: {'P@1': 0.52667, 'P@3': 0.23111, 'P@5': 0.15, 'P@10': 0.085, 'P@100': 0.0103, 'P@1000': 0.00111}

onnx dynamic uint8 model with f32 output (this model's parent):

ndcg: {'NDCG@1': 0.56, 'NDCG@3': 0.63242, 'NDCG@5': 0.66258, 'NDCG@10': 0.68893, 'NDCG@100': 0.71276, 'NDCG@1000': 0.72}
recall: {'Recall@1': 0.53094, 'Recall@3': 0.68117, 'Recall@5': 0.75417, 'Recall@10': 0.83256, 'Recall@100': 0.94, 'Recall@1000': 0.99667}
precision: {'P@1': 0.56, 'P@3': 0.24778, 'P@5': 0.16867, 'P@10': 0.094, 'P@100': 0.0107, 'P@1000': 0.00113}

onnx dynamic uint8 model with uint8 output (this model):

ndcg: {'NDCG@1': 0.56, 'NDCG@3': 0.63119, 'NDCG@5': 0.66314, 'NDCG@10': 0.68867, 'NDCG@100': 0.71236, 'NDCG@1000': 0.7201}
recall: {'Recall@1': 0.53094, 'Recall@3': 0.67783, 'Recall@5': 0.75583, 'Recall@10': 0.83089, 'Recall@100': 0.93667, 'Recall@1000': 0.99667}
precision: {'P@1': 0.56, 'P@3': 0.24667, 'P@5': 0.16867, 'P@10': 0.094, 'P@100': 0.01067, 'P@1000': 0.00113}

Example inference/benchmark code and how to use the model with Fastembed

After installing beir-qdrant make sure to upgrade fastembed.

# pip install qdrant_client beir-qdrant
# pip install -U fastembed
from fastembed import TextEmbedding
from fastembed.common.model_description import PoolingType, ModelSource
from beir import util
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.evaluation import EvaluateRetrieval
from qdrant_client import QdrantClient
from qdrant_client.models import Datatype
from beir_qdrant.retrieval.models.fastembed import DenseFastEmbedModelAdapter
from beir_qdrant.retrieval.search.dense import DenseQdrantSearch

TextEmbedding.add_custom_model(
    model="electroglyph/Qwen3-Embedding-0.6B-onnx-uint8",
    pooling=PoolingType.DISABLED,
    normalization=False,
    sources=ModelSource(hf="electroglyph/Qwen3-Embedding-0.6B-onnx-uint8"),
    dim=1024,
    model_file="dynamic_uint8.onnx",
)

dataset = "scifact"
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
data_path = util.download_and_unzip(url, "datasets")
corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")

# IMPORTANT: USE THIS (OR A SIMILAR) QUERY FORMAT WITH THIS MODEL:
for k in queries.keys():
    queries[k] = (
        f"Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: {queries[k]}"
    )

qdrant_client = QdrantClient("http://localhost:6333")

model = DenseQdrantSearch(
    qdrant_client,
    model=DenseFastEmbedModelAdapter(model_name="Qwen3-Embedding-0.6B-onnx-uint8"),
    collection_name="scifact-qwen3-uint8",
    initialize=True,
    datatype=Datatype.UINT8,
)

retriever = EvaluateRetrieval(model)
results = retriever.retrieve(corpus, queries)

ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)
print(f"ndcg: {ndcg}\nrecall: {recall}\nprecision: {precision}")
Downloads last month
138
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for electroglyph/Qwen3-Embedding-0.6B-onnx-uint8

Quantized
(52)
this model