electroglyph
/

Qwen3-Embedding-0.6B-onnx-uint8

@@ -1,3 +1,15 @@
 # Qwen3-Embedding-0.6B-onnx-uint8
 This is an onnx version of https://huggingface.co/Qwen/Qwen3-Embedding-0.6B
@@ -9,6 +21,508 @@ This model is compatible with qdrant fastembed, please note these details:
 - Execute model without pooling and without normalization
 - Pay attention to the example query format in the code below
 # Benchmarks
 I used beir-qdrant with the scifact dataset.
@@ -19,13 +533,8 @@ I welcome any additional benchmarks by the community, please feel free to share
 If someone wants to sponsor me with an NVIDIA GPU I can have a much faster turnaround time with my model experiments and explore some different quantization strategies.
-edit: I've done pretty extensive testing, including comparing benchmarks to:
-https://huggingface.co/onnx-community/Qwen3-Embedding-0.6B-ONNX/blob/main/onnx/model_uint8.onnx
-and haven't been able to surpass this initial model.
-onnx f32 model with f32 output:
 ```
 ndcg: {'NDCG@1': 0.57, 'NDCG@3': 0.65655, 'NDCG@5': 0.68177, 'NDCG@10': 0.69999, 'NDCG@100': 0.72749, 'NDCG@1000': 0.73301}
@@ -33,7 +542,7 @@ recall: {'Recall@1': 0.53828, 'Recall@3': 0.71517, 'Recall@5': 0.77883, 'Recall@
 precision: {'P@1': 0.57, 'P@3': 0.26111, 'P@5': 0.17467, 'P@10': 0.09467, 'P@100': 0.01083, 'P@1000': 0.00113}
 ```
-onnx dynamic uint8 model with f32 output:
 ```
 ndcg: {'NDCG@1': 0.52333, 'NDCG@3': 0.58087, 'NDCG@5': 0.59811, 'NDCG@10': 0.6249, 'NDCG@100': 0.66025, 'NDCG@1000': 0.67023}
@@ -41,7 +550,9 @@ recall: {'Recall@1': 0.4965, 'Recall@3': 0.62211, 'Recall@5': 0.66622, 'Recall@1
 precision: {'P@1': 0.52333, 'P@3': 0.22889, 'P@5': 0.15, 'P@10': 0.085, 'P@100': 0.0103, 'P@1000': 0.00111}
 ```
-onnx dynamic uint8 model with uint8 output (this model):
 ```
 ndcg: {'NDCG@1': 0.52667, 'NDCG@3': 0.58478, 'NDCG@5': 0.60006, 'NDCG@10': 0.62646, 'NDCG@100': 0.66175, 'NDCG@1000': 0.67171}
@@ -49,6 +560,22 @@ recall: {'Recall@1': 0.49983, 'Recall@3': 0.62711, 'Recall@5': 0.66706, 'Recall@
 precision: {'P@1': 0.52667, 'P@3': 0.23111, 'P@5': 0.15, 'P@10': 0.085, 'P@100': 0.0103, 'P@1000': 0.00111}
 ```
 # Example inference/benchmark code and how to use the model with Fastembed
 After installing beir-qdrant make sure to upgrade fastembed.

+# Update
+I've improved the quality of the model, but size increased from 571MiB to 624MiB.
+There's now only a ~1% difference in retrieval performance between this model and the full f32 model.
+This model is ~6% more accurate at retrieval than the onnx-community uint8 model with f32 output.
+This model is somewhere around 3.5% more accurate at retrieval than the previous version of this model.
+Inference speed was the same on my hardware vs. previous model (Ryzen CPU).
 # Qwen3-Embedding-0.6B-onnx-uint8
 This is an onnx version of https://huggingface.co/Qwen/Qwen3-Embedding-0.6B
 - Execute model without pooling and without normalization
 - Pay attention to the example query format in the code below
+# Quantization method
+I created a little onnx model instrumentation framework to assist in quantization. I generated calibration data, created an instrumented onnx model, and recorded the range of values for every tensor in the model during inference. I tested different criteria for excluding nodes until I settled on what I felt was a good size/accuracy tradeoff. I ended up excluding 484 of the most sensitive nodes from quantization.
+After that I generated 1 million tokens of calibration data and recorded the range of float32 outputs seen during inference.
+The range I found: -0.3009805381298065 to 0.3952634334564209
+I used that range for an assymmetric linear quantization from float32 -> uint8.
+<details>
+  <summary>Here are the nodes I excluded</summary>
+```python
+["/0/auto_model/ConstantOfShape",
+"/0/auto_model/Constant_28",
+"/0/auto_model/layers.25/post_attention_layernorm/Pow",
+"/0/auto_model/layers.26/input_layernorm/Pow",
+"/0/auto_model/layers.25/input_layernorm/Pow",
+"/0/auto_model/layers.24/post_attention_layernorm/Pow",
+"/0/auto_model/layers.24/input_layernorm/Pow",
+"/0/auto_model/layers.23/post_attention_layernorm/Pow",
+"/0/auto_model/layers.23/input_layernorm/Pow",
+"/0/auto_model/layers.22/post_attention_layernorm/Pow",
+"/0/auto_model/layers.22/input_layernorm/Pow",
+"/0/auto_model/layers.3/input_layernorm/Pow",
+"/0/auto_model/layers.4/input_layernorm/Pow",
+"/0/auto_model/layers.3/post_attention_layernorm/Pow",
+"/0/auto_model/layers.21/post_attention_layernorm/Pow",
+"/0/auto_model/layers.5/input_layernorm/Pow",
+"/0/auto_model/layers.4/post_attention_layernorm/Pow",
+"/0/auto_model/layers.5/post_attention_layernorm/Pow",
+"/0/auto_model/layers.6/input_layernorm/Pow",
+"/0/auto_model/layers.6/post_attention_layernorm/Pow",
+"/0/auto_model/layers.7/input_layernorm/Pow",
+"/0/auto_model/layers.8/input_layernorm/Pow",
+"/0/auto_model/layers.7/post_attention_layernorm/Pow",
+"/0/auto_model/layers.26/post_attention_layernorm/Pow",
+"/0/auto_model/layers.9/input_layernorm/Pow",
+"/0/auto_model/layers.8/post_attention_layernorm/Pow",
+"/0/auto_model/layers.21/input_layernorm/Pow",
+"/0/auto_model/layers.20/post_attention_layernorm/Pow",
+"/0/auto_model/layers.9/post_attention_layernorm/Pow",
+"/0/auto_model/layers.10/input_layernorm/Pow",
+"/0/auto_model/layers.20/input_layernorm/Pow",
+"/0/auto_model/layers.11/input_layernorm/Pow",
+"/0/auto_model/layers.10/post_attention_layernorm/Pow",
+"/0/auto_model/layers.12/input_layernorm/Pow",
+"/0/auto_model/layers.11/post_attention_layernorm/Pow",
+"/0/auto_model/layers.12/post_attention_layernorm/Pow",
+"/0/auto_model/layers.13/input_layernorm/Pow",
+"/0/auto_model/layers.19/post_attention_layernorm/Pow",
+"/0/auto_model/layers.13/post_attention_layernorm/Pow",
+"/0/auto_model/layers.14/input_layernorm/Pow",
+"/0/auto_model/layers.19/input_layernorm/Pow",
+"/0/auto_model/layers.18/post_attention_layernorm/Pow",
+"/0/auto_model/layers.14/post_attention_layernorm/Pow",
+"/0/auto_model/layers.15/input_layernorm/Pow",
+"/0/auto_model/layers.16/input_layernorm/Pow",
+"/0/auto_model/layers.15/post_attention_layernorm/Pow",
+"/0/auto_model/layers.18/input_layernorm/Pow",
+"/0/auto_model/layers.17/post_attention_layernorm/Pow",
+"/0/auto_model/layers.17/input_layernorm/Pow",
+"/0/auto_model/layers.16/post_attention_layernorm/Pow",
+"/0/auto_model/layers.27/post_attention_layernorm/Pow",
+"/0/auto_model/layers.27/input_layernorm/Pow",
+"/0/auto_model/norm/Pow",
+"/0/auto_model/layers.25/post_attention_layernorm/ReduceMean",
+"/0/auto_model/layers.25/post_attention_layernorm/Add",
+"/0/auto_model/layers.26/input_layernorm/Add",
+"/0/auto_model/layers.26/input_layernorm/ReduceMean",
+"/0/auto_model/layers.25/input_layernorm/ReduceMean",
+"/0/auto_model/layers.25/input_layernorm/Add",
+"/0/auto_model/layers.24/post_attention_layernorm/ReduceMean",
+"/0/auto_model/layers.24/post_attention_layernorm/Add",
+"/0/auto_model/layers.24/input_layernorm/Add",
+"/0/auto_model/layers.24/input_layernorm/ReduceMean",
+"/0/auto_model/layers.23/post_attention_layernorm/Add",
+"/0/auto_model/layers.23/post_attention_layernorm/ReduceMean",
+"/0/auto_model/layers.23/input_layernorm/ReduceMean",
+"/0/auto_model/layers.23/input_layernorm/Add",
+"/0/auto_model/layers.22/post_attention_layernorm/ReduceMean",
+"/0/auto_model/layers.22/post_attention_layernorm/Add",
+"/0/auto_model/layers.26/post_attention_layernorm/ReduceMean",
+"/0/auto_model/layers.26/post_attention_layernorm/Add",
+"/0/auto_model/layers.22/input_layernorm/ReduceMean",
+"/0/auto_model/layers.22/input_layernorm/Add",
+"/0/auto_model/layers.3/input_layernorm/Add",
+"/0/auto_model/layers.3/input_layernorm/ReduceMean",
+"/0/auto_model/layers.21/post_attention_layernorm/ReduceMean",
+"/0/auto_model/layers.21/post_attention_layernorm/Add",
+"/0/auto_model/layers.4/input_layernorm/Add",
+"/0/auto_model/layers.4/input_layernorm/ReduceMean",
+"/0/auto_model/layers.3/post_attention_layernorm/Add",
+"/0/auto_model/layers.3/post_attention_layernorm/ReduceMean",
+"/0/auto_model/layers.5/input_layernorm/Add",
+"/0/auto_model/layers.5/input_layernorm/ReduceMean",
+"/0/auto_model/layers.4/post_attention_layernorm/ReduceMean",
+"/0/auto_model/layers.4/post_attention_layernorm/Add",
+"/0/auto_model/layers.5/post_attention_layernorm/Add",
+"/0/auto_model/layers.5/post_attention_layernorm/ReduceMean",
+"/0/auto_model/layers.6/input_layernorm/Add",
+"/0/auto_model/layers.6/input_layernorm/ReduceMean",
+"/0/auto_model/layers.6/post_attention_layernorm/Add",
+"/0/auto_model/layers.6/post_attention_layernorm/ReduceMean",
+"/0/auto_model/layers.7/input_layernorm/Add",
+"/0/auto_model/layers.7/input_layernorm/ReduceMean",
+"/0/auto_model/layers.8/input_layernorm/ReduceMean",
+"/0/auto_model/layers.8/input_layernorm/Add",
+"/0/auto_model/layers.7/post_attention_layernorm/Add",
+"/0/auto_model/layers.7/post_attention_layernorm/ReduceMean",
+"/0/auto_model/layers.9/input_layernorm/Add",
+"/0/auto_model/layers.9/input_layernorm/ReduceMean",
+"/0/auto_model/layers.8/post_attention_layernorm/Add",
+"/0/auto_model/layers.8/post_attention_layernorm/ReduceMean",
+"/0/auto_model/layers.21/input_layernorm/Add",
+"/0/auto_model/layers.21/input_layernorm/ReduceMean",
+"/0/auto_model/layers.20/post_attention_layernorm/Add",
+"/0/auto_model/layers.20/post_attention_layernorm/ReduceMean",
+"/0/auto_model/layers.9/post_attention_layernorm/ReduceMean",
+"/0/auto_model/layers.9/post_attention_layernorm/Add",
+"/0/auto_model/layers.10/input_layernorm/ReduceMean",
+"/0/auto_model/layers.10/input_layernorm/Add",
+"/0/auto_model/layers.20/input_layernorm/Add",
+"/0/auto_model/layers.20/input_layernorm/ReduceMean",
+"/0/auto_model/layers.11/input_layernorm/ReduceMean",
+"/0/auto_model/layers.11/input_layernorm/Add",
+"/0/auto_model/layers.10/post_attention_layernorm/ReduceMean",
+"/0/auto_model/layers.10/post_attention_layernorm/Add",
+"/0/auto_model/layers.12/input_layernorm/ReduceMean",
+"/0/auto_model/layers.12/input_layernorm/Add",
+"/0/auto_model/layers.11/post_attention_layernorm/Add",
+"/0/auto_model/layers.11/post_attention_layernorm/ReduceMean",
+"/0/auto_model/layers.12/post_attention_layernorm/ReduceMean",
+"/0/auto_model/layers.12/post_attention_layernorm/Add",
+"/0/auto_model/layers.13/input_layernorm/Add",
+"/0/auto_model/layers.13/input_layernorm/ReduceMean",
+"/0/auto_model/layers.19/post_attention_layernorm/Add",
+"/0/auto_model/layers.19/post_attention_layernorm/ReduceMean",
+"/0/auto_model/layers.13/post_attention_layernorm/ReduceMean",
+"/0/auto_model/layers.13/post_attention_layernorm/Add",
+"/0/auto_model/layers.14/input_layernorm/Add",
+"/0/auto_model/layers.14/input_layernorm/ReduceMean",
+"/0/auto_model/layers.19/input_layernorm/ReduceMean",
+"/0/auto_model/layers.19/input_layernorm/Add",
+"/0/auto_model/layers.18/post_attention_layernorm/ReduceMean",
+"/0/auto_model/layers.18/post_attention_layernorm/Add",
+"/0/auto_model/layers.14/post_attention_layernorm/ReduceMean",
+"/0/auto_model/layers.14/post_attention_layernorm/Add",
+"/0/auto_model/layers.15/input_layernorm/ReduceMean",
+"/0/auto_model/layers.15/input_layernorm/Add",
+"/0/auto_model/layers.16/input_layernorm/Add",
+"/0/auto_model/layers.16/input_layernorm/ReduceMean",
+"/0/auto_model/layers.15/post_attention_layernorm/Add",
+"/0/auto_model/layers.15/post_attention_layernorm/ReduceMean",
+"/0/auto_model/layers.18/input_layernorm/Add",
+"/0/auto_model/layers.18/input_layernorm/ReduceMean",
+"/0/auto_model/layers.17/post_attention_layernorm/Add",
+"/0/auto_model/layers.17/post_attention_layernorm/ReduceMean",
+"/0/auto_model/layers.17/input_layernorm/ReduceMean",
+"/0/auto_model/layers.17/input_layernorm/Add",
+"/0/auto_model/layers.16/post_attention_layernorm/Add",
+"/0/auto_model/layers.16/post_attention_layernorm/ReduceMean",
+"/0/auto_model/layers.27/post_attention_layernorm/Add",
+"/0/auto_model/layers.27/post_attention_layernorm/ReduceMean",
+"/0/auto_model/layers.27/input_layernorm/Add",
+"/0/auto_model/layers.27/input_layernorm/ReduceMean",
+"/0/auto_model/layers.27/self_attn/q_norm/Pow",
+"/0/auto_model/layers.14/self_attn/k_norm/Pow",
+"/0/auto_model/layers.26/self_attn/q_norm/Pow",
+"/0/auto_model/layers.25/self_attn/q_norm/Pow",
+"/0/auto_model/layers.26/self_attn/k_norm/Pow",
+"/0/auto_model/layers.8/self_attn/k_norm/Pow",
+"/0/auto_model/layers.24/self_attn/k_norm/Pow",
+"/0/auto_model/layers.24/self_attn/q_norm/Pow",
+"/0/auto_model/layers.25/self_attn/k_norm/Pow",
+"/0/auto_model/layers.23/self_attn/q_norm/Pow",
+"/0/auto_model/layers.27/self_attn/k_norm/Pow",
+"/0/auto_model/layers.12/self_attn/k_norm/Pow",
+"/0/auto_model/layers.13/self_attn/k_norm/Pow",
+"/0/auto_model/layers.2/mlp/down_proj/MatMul",
+"/0/auto_model/layers.3/post_attention_layernorm/Cast",
+"/0/auto_model/layers.3/Add",
+"/0/auto_model/layers.3/Add_1",
+"/0/auto_model/layers.4/input_layernorm/Cast",
+"/0/auto_model/layers.3/input_layernorm/Cast",
+"/0/auto_model/layers.2/Add_1",
+"/0/auto_model/layers.4/Add",
+"/0/auto_model/layers.4/post_attention_layernorm/Cast",
+"/0/auto_model/layers.5/input_layernorm/Cast",
+"/0/auto_model/layers.4/Add_1",
+"/0/auto_model/layers.5/post_attention_layernorm/Cast",
+"/0/auto_model/layers.5/Add",
+"/0/auto_model/layers.5/Add_1",
+"/0/auto_model/layers.6/input_layernorm/Cast",
+"/0/auto_model/layers.7/Add_1",
+"/0/auto_model/layers.8/input_layernorm/Cast",
+"/0/auto_model/layers.7/Add",
+"/0/auto_model/layers.7/post_attention_layernorm/Cast",
+"/0/auto_model/layers.6/Add",
+"/0/auto_model/layers.6/post_attention_layernorm/Cast",
+"/0/auto_model/layers.6/Add_1",
+"/0/auto_model/layers.7/input_layernorm/Cast",
+"/0/auto_model/layers.8/Add",
+"/0/auto_model/layers.8/post_attention_layernorm/Cast",
+"/0/auto_model/layers.9/input_layernorm/Cast",
+"/0/auto_model/layers.8/Add_1",
+"/0/auto_model/layers.9/post_attention_layernorm/Cast",
+"/0/auto_model/layers.9/Add",
+"/0/auto_model/layers.9/Add_1",
+"/0/auto_model/layers.10/input_layernorm/Cast",
+"/0/auto_model/layers.11/input_layernorm/Cast",
+"/0/auto_model/layers.10/Add_1",
+"/0/auto_model/layers.10/Add",
+"/0/auto_model/layers.10/post_attention_layernorm/Cast",
+"/0/auto_model/layers.11/Add",
+"/0/auto_model/layers.11/post_attention_layernorm/Cast",
+"/0/auto_model/layers.11/Add_1",
+"/0/auto_model/layers.12/input_layernorm/Cast",
+"/0/auto_model/layers.12/Add",
+"/0/auto_model/layers.12/post_attention_layernorm/Cast",
+"/0/auto_model/layers.12/Add_1",
+"/0/auto_model/layers.13/input_layernorm/Cast",
+"/0/auto_model/layers.13/Add",
+"/0/auto_model/layers.13/post_attention_layernorm/Cast",
+"/0/auto_model/layers.14/input_layernorm/Cast",
+"/0/auto_model/layers.13/Add_1",
+"/0/auto_model/layers.14/Add_1",
+"/0/auto_model/layers.15/input_layernorm/Cast",
+"/0/auto_model/layers.14/post_attention_layernorm/Cast",
+"/0/auto_model/layers.14/Add",
+"/0/auto_model/layers.15/post_attention_layernorm/Cast",
+"/0/auto_model/layers.15/Add_1",
+"/0/auto_model/layers.16/input_layernorm/Cast",
+"/0/auto_model/layers.15/Add",
+"/0/auto_model/layers.17/input_layernorm/Cast",
+"/0/auto_model/layers.16/Add_1",
+"/0/auto_model/layers.16/Add",
+"/0/auto_model/layers.16/post_attention_layernorm/Cast",
+"/0/auto_model/layers.19/input_layernorm/Cast",
+"/0/auto_model/layers.18/Add_1",
+"/0/auto_model/layers.18/input_layernorm/Cast",
+"/0/auto_model/layers.17/Add_1",
+"/0/auto_model/layers.17/Add",
+"/0/auto_model/layers.17/post_attention_layernorm/Cast",
+"/0/auto_model/layers.18/post_attention_layernorm/Cast",
+"/0/auto_model/layers.18/Add",
+"/0/auto_model/layers.19/Add",
+"/0/auto_model/layers.19/post_attention_layernorm/Cast",
+"/0/auto_model/layers.22/Add_1",
+"/0/auto_model/layers.23/input_layernorm/Cast",
+"/0/auto_model/layers.20/Add_1",
+"/0/auto_model/layers.21/input_layernorm/Cast",
+"/0/auto_model/layers.21/Add_1",
+"/0/auto_model/layers.22/input_layernorm/Cast",
+"/0/auto_model/layers.19/Add_1",
+"/0/auto_model/layers.20/input_layernorm/Cast",
+"/0/auto_model/layers.24/input_layernorm/Cast",
+"/0/auto_model/layers.23/Add_1",
+"/0/auto_model/layers.22/Add",
+"/0/auto_model/layers.22/post_attention_layernorm/Cast",
+"/0/auto_model/layers.21/Add",
+"/0/auto_model/layers.21/post_attention_layernorm/Cast",
+"/0/auto_model/layers.20/Add",
+"/0/auto_model/layers.20/post_attention_layernorm/Cast",
+"/0/auto_model/layers.23/post_attention_layernorm/Cast",
+"/0/auto_model/layers.23/Add",
+"/0/auto_model/layers.25/input_layernorm/Cast",
+"/0/auto_model/layers.24/Add_1",
+"/0/auto_model/layers.24/post_attention_layernorm/Cast",
+"/0/auto_model/layers.24/Add",
+"/0/auto_model/layers.25/Add",
+"/0/auto_model/layers.25/post_attention_layernorm/Cast",
+"/0/auto_model/layers.25/Add_1",
+"/0/auto_model/layers.26/input_layernorm/Cast",
+"/0/auto_model/layers.26/Add",
+"/0/auto_model/layers.26/post_attention_layernorm/Cast",
+"/0/auto_model/layers.21/self_attn/q_norm/Pow",
+"/0/auto_model/layers.26/Add_1",
+"/0/auto_model/layers.27/input_layernorm/Cast",
+"/0/auto_model/layers.27/Add",
+"/0/auto_model/layers.27/post_attention_layernorm/Cast",
+"/0/auto_model/norm/Add",
+"/0/auto_model/norm/ReduceMean",
+"/0/auto_model/layers.23/self_attn/k_norm/Pow",
+"/0/auto_model/layers.21/self_attn/k_norm/Pow",
+"/0/auto_model/layers.22/self_attn/k_norm/Pow",
+"/0/auto_model/layers.10/self_attn/k_norm/Pow",
+"/0/auto_model/layers.19/self_attn/q_norm/Pow",
+"/0/auto_model/layers.2/mlp/Mul",
+"/0/auto_model/layers.22/self_attn/q_norm/Pow",
+"/0/auto_model/layers.11/self_attn/k_norm/Pow",
+"/0/auto_model/layers.20/self_attn/q_norm/Pow",
+"/0/auto_model/layers.20/self_attn/k_norm/Pow",
+"/0/auto_model/layers.18/self_attn/q_norm/Pow",
+"/0/auto_model/layers.17/self_attn/q_norm/Pow",
+"/0/auto_model/layers.27/mlp/down_proj/MatMul",
+"/0/auto_model/layers.19/self_attn/k_norm/Pow",
+"/0/auto_model/layers.27/Add_1",
+"/0/auto_model/norm/Cast",
+"/0/auto_model/layers.16/self_attn/k_norm/Pow",
+"/0/auto_model/layers.18/self_attn/k_norm/Pow",
+"/0/auto_model/layers.11/self_attn/q_norm/Pow",
+"/0/auto_model/layers.9/self_attn/q_norm/Pow",
+"/0/auto_model/layers.26/self_attn/q_norm/Add",
+"/0/auto_model/layers.26/self_attn/q_norm/ReduceMean",
+"/0/auto_model/layers.14/self_attn/k_norm/Add",
+"/0/auto_model/layers.14/self_attn/k_norm/ReduceMean",
+"/0/auto_model/layers.16/self_attn/q_norm/Pow",
+"/0/auto_model/layers.27/mlp/Mul",
+"/0/auto_model/layers.27/self_attn/q_norm/ReduceMean",
+"/0/auto_model/layers.27/self_attn/q_norm/Add",
+"/0/auto_model/layers.9/self_attn/k_norm/Pow",
+"/0/auto_model/layers.17/self_attn/k_norm/Pow",
+"/0/auto_model/layers.26/self_attn/k_norm/ReduceMean",
+"/0/auto_model/layers.26/self_attn/k_norm/Add",
+"/0/auto_model/layers.25/self_attn/k_norm/Add",
+"/0/auto_model/layers.25/self_attn/k_norm/ReduceMean",
+"/0/auto_model/layers.13/self_attn/k_norm/Add",
+"/0/auto_model/layers.13/self_attn/k_norm/ReduceMean",
+"/0/auto_model/layers.10/self_attn/q_norm/Pow",
+"/0/auto_model/layers.25/input_layernorm/Mul_1",
+"/0/auto_model/layers.27/self_attn/k_norm/ReduceMean",
+"/0/auto_model/layers.27/self_attn/k_norm/Add",
+"/0/auto_model/layers.26/input_layernorm/Mul_1",
+"/0/auto_model/layers.15/self_attn/q_norm/Pow",
+"/0/auto_model/layers.12/self_attn/k_norm/Add",
+"/0/auto_model/layers.12/self_attn/k_norm/ReduceMean",
+"/0/auto_model/layers.25/self_attn/q_norm/Add",
+"/0/auto_model/layers.25/self_attn/q_norm/ReduceMean",
+"/0/auto_model/layers.24/input_layernorm/Mul_1",
+"/0/auto_model/layers.12/self_attn/q_norm/Pow",
+"/0/auto_model/layers.24/self_attn/q_norm/ReduceMean",
+"/0/auto_model/layers.24/self_attn/q_norm/Add",
+"/0/auto_model/layers.24/self_attn/k_norm/ReduceMean",
+"/0/auto_model/layers.24/self_attn/k_norm/Add",
+"/0/auto_model/layers.22/mlp/Mul",
+"/0/auto_model/layers.2/post_attention_layernorm/Pow",
+"/0/auto_model/layers.23/mlp/Mul",
+"/0/auto_model/layers.24/mlp/Mul",
+"/0/auto_model/layers.23/input_layernorm/Mul_1",
+"/0/auto_model/layers.14/self_attn/q_norm/Pow",
+"/0/auto_model/layers.14/self_attn/k_proj/MatMul",
+"/0/auto_model/layers.14/self_attn/k_norm/Cast",
+"/0/auto_model/layers.14/self_attn/Reshape_1",
+"/0/auto_model/layers.21/mlp/Mul",
+"/0/auto_model/layers.3/post_attention_layernorm/Sqrt",
+"/0/auto_model/layers.3/input_layernorm/Sqrt",
+"/0/auto_model/layers.4/input_layernorm/Sqrt",
+"/0/auto_model/layers.5/input_layernorm/Sqrt",
+"/0/auto_model/layers.4/post_attention_layernorm/Sqrt",
+"/0/auto_model/layers.5/post_attention_layernorm/Sqrt",
+"/0/auto_model/layers.6/input_layernorm/Sqrt",
+"/0/auto_model/layers.6/post_attention_layernorm/Sqrt",
+"/0/auto_model/layers.8/input_layernorm/Sqrt",
+"/0/auto_model/layers.8/post_attention_layernorm/Sqrt",
+"/0/auto_model/layers.7/post_attention_layernorm/Sqrt",
+"/0/auto_model/layers.7/input_layernorm/Sqrt",
+"/0/auto_model/layers.9/input_layernorm/Sqrt",
+"/0/auto_model/layers.10/input_layernorm/Sqrt",
+"/0/auto_model/layers.9/post_attention_layernorm/Sqrt",
+"/0/auto_model/layers.11/input_layernorm/Sqrt",
+"/0/auto_model/layers.10/post_attention_layernorm/Sqrt",
+"/0/auto_model/layers.12/post_attention_layernorm/Sqrt",
+"/0/auto_model/layers.11/post_attention_layernorm/Sqrt",
+"/0/auto_model/layers.12/input_layernorm/Sqrt",
+"/0/auto_model/layers.13/input_layernorm/Sqrt",
+"/0/auto_model/layers.14/input_layernorm/Sqrt",
+"/0/auto_model/layers.13/post_attention_layernorm/Sqrt",
+"/0/auto_model/layers.15/input_layernorm/Sqrt",
+"/0/auto_model/layers.14/post_attention_layernorm/Sqrt",
+"/0/auto_model/layers.16/input_layernorm/Sqrt",
+"/0/auto_model/layers.15/post_attention_layernorm/Sqrt",
+"/0/auto_model/layers.17/input_layernorm/Sqrt",
+"/0/auto_model/layers.16/post_attention_layernorm/Sqrt",
+"/0/auto_model/layers.19/input_layernorm/Sqrt",
+"/0/auto_model/layers.17/post_attention_layernorm/Sqrt",
+"/0/auto_model/layers.18/input_layernorm/Sqrt",
+"/0/auto_model/layers.18/post_attention_layernorm/Sqrt",
+"/0/auto_model/layers.19/post_attention_layernorm/Sqrt",
+"/0/auto_model/layers.23/input_layernorm/Sqrt",
+"/0/auto_model/layers.20/input_layernorm/Sqrt",
+"/0/auto_model/layers.21/input_layernorm/Sqrt",
+"/0/auto_model/layers.22/input_layernorm/Sqrt",
+"/0/auto_model/layers.22/post_attention_layernorm/Sqrt",
+"/0/auto_model/layers.24/input_layernorm/Sqrt",
+"/0/auto_model/layers.20/post_attention_layernorm/Sqrt",
+"/0/auto_model/layers.21/post_attention_layernorm/Sqrt",
+"/0/auto_model/layers.23/post_attention_layernorm/Sqrt",
+"/0/auto_model/layers.25/input_layernorm/Sqrt",
+"/0/auto_model/layers.24/post_attention_layernorm/Sqrt",
+"/0/auto_model/layers.25/post_attention_layernorm/Sqrt",
+"/0/auto_model/layers.26/input_layernorm/Sqrt",
+"/0/auto_model/layers.26/post_attention_layernorm/Sqrt",
+"/0/auto_model/layers.15/self_attn/k_norm/Pow",
+"/0/auto_model/layers.27/input_layernorm/Sqrt",
+"/0/auto_model/layers.27/post_attention_layernorm/Sqrt",
+"/0/auto_model/layers.2/input_layernorm/Pow",
+"/0/auto_model/layers.26/mlp/Mul",
+"/0/auto_model/layers.23/self_attn/q_norm/Add",
+"/0/auto_model/layers.23/self_attn/q_norm/ReduceMean",
+"/0/auto_model/layers.13/self_attn/q_norm/Pow",
+"/0/auto_model/layers.21/self_attn/q_norm/Add",
+"/0/auto_model/layers.21/self_attn/q_norm/ReduceMean",
+"/0/auto_model/layers.6/self_attn/q_norm/Pow",
+"/0/auto_model/layers.27/self_attn/Reshape_7",
+"/0/auto_model/layers.27/self_attn/MatMul_1",
+"/0/auto_model/layers.27/self_attn/Transpose_4",
+"/0/auto_model/layers.26/self_attn/Expand_1",
+"/0/auto_model/layers.26/self_attn/Unsqueeze_19",
+"/0/auto_model/layers.26/self_attn/v_proj/MatMul",
+"/0/auto_model/layers.26/self_attn/Transpose_2",
+"/0/auto_model/layers.26/self_attn/Reshape_6",
+"/0/auto_model/layers.26/self_attn/Reshape_2",
+"/0/auto_model/layers.11/self_attn/k_norm/ReduceMean",
+"/0/auto_model/layers.11/self_attn/k_norm/Add",
+"/0/auto_model/layers.22/input_layernorm/Mul_1",
+"/0/auto_model/layers.25/mlp/Mul",
+"/0/auto_model/layers.8/self_attn/k_norm/Cast",
+"/0/auto_model/layers.8/self_attn/k_proj/MatMul",
+"/0/auto_model/layers.8/self_attn/Reshape_1",
+"/0/auto_model/layers.21/input_layernorm/Mul_1",
+"/0/auto_model/layers.5/self_attn/q_norm/Pow",
+"/0/auto_model/layers.22/self_attn/q_norm/ReduceMean",
+"/0/auto_model/layers.22/self_attn/q_norm/Add",
+"/0/auto_model/layers.22/mlp/down_proj/MatMul",
+"/0/auto_model/layers.23/self_attn/k_norm/ReduceMean",
+"/0/auto_model/layers.23/self_attn/k_norm/Add",
+"/0/auto_model/layers.23/mlp/down_proj/MatMul",
+"/0/auto_model/layers.26/mlp/down_proj/MatMul",
+"/0/auto_model/layers.1/self_attn/Add_2",
+"/0/auto_model/layers.2/self_attn/Add_2",
+"/0/auto_model/layers.6/self_attn/Add_2",
+"/0/auto_model/layers.11/self_attn/Add_2",
+"/0/auto_model/layers.12/self_attn/Add_2",
+"/0/auto_model/layers.16/self_attn/Add_2",
+"/0/auto_model/layers.21/self_attn/Add_2",
+"/0/auto_model/layers.24/self_attn/Add_2",
+"/0/auto_model/layers.0/self_attn/Add_2",
+"/0/auto_model/layers.8/self_attn/Add_2",
+"/0/auto_model/layers.13/self_attn/Add_2",
+"/0/auto_model/layers.26/self_attn/Add_2",
+"/0/auto_model/layers.3/self_attn/Add_2",
+"/0/auto_model/layers.15/self_attn/Add_2",
+"/0/auto_model/layers.25/self_attn/Add_2",
+"/0/auto_model/layers.4/self_attn/Add_2",
+"/0/auto_model/layers.14/self_attn/Add_2",
+"/0/auto_model/layers.22/self_attn/Add_2",
+"/0/auto_model/layers.9/self_attn/Add_2",
+"/0/auto_model/layers.23/self_attn/Add_2",
+"/0/auto_model/layers.10/self_attn/Add_2",
+"/0/auto_model/layers.5/self_attn/Add_2",
+"/0/auto_model/layers.19/self_attn/Add_2",
+"/0/auto_model/layers.7/self_attn/Add_2",
+"/0/auto_model/layers.27/self_attn/Add_2",
+"/0/auto_model/layers.18/self_attn/Add_2",
+"/0/auto_model/layers.20/self_attn/Add_2",
+"/0/auto_model/layers.17/self_attn/Add_2",
+"/0/auto_model/Slice_1",
+"/0/auto_model/layers.5/self_attn/Slice_4",
+"/0/auto_model/layers.12/self_attn/Slice_4",
+"/0/auto_model/layers.18/self_attn/Slice_4",
+"/0/auto_model/layers.3/self_attn/Slice_4",
+"/0/auto_model/layers.11/self_attn/Slice_4",
+"/0/auto_model/layers.22/self_attn/Slice_4",
+"/0/auto_model/Expand",
+"/0/auto_model/layers.4/self_attn/Slice_4",
+"/0/auto_model/Slice_2",
+"/0/auto_model/layers.8/self_attn/Slice_4",
+"/0/auto_model/layers.2/self_attn/Slice_4",
+"/0/auto_model/layers.15/self_attn/Slice_4",
+"/0/auto_model/layers.26/self_attn/Slice_4",
+"/0/auto_model/layers.24/self_attn/Slice_4",
+"/0/auto_model/Expand_1",
+"/0/auto_model/layers.14/self_attn/Slice_4",
+"/0/auto_model/layers.21/self_attn/Slice_4",
+"/0/auto_model/layers.1/self_attn/Slice_4",
+"/0/auto_model/Reshape_2",
+"/0/auto_model/layers.19/self_attn/Slice_4",
+"/0/auto_model/Slice",
+"/0/auto_model/layers.6/self_attn/Slice_4",
+"/0/auto_model/layers.0/self_attn/Slice_4",
+"/0/auto_model/layers.25/self_attn/Slice_4",
+"/0/auto_model/Unsqueeze_4",
+"/0/auto_model/layers.10/self_attn/Slice_4",
+"/0/auto_model/layers.23/self_attn/Slice_4",
+"/0/auto_model/layers.17/self_attn/Slice_4",
+"/0/auto_model/Where_1",
+"/0/auto_model/layers.27/self_attn/Slice_4",
+"/0/auto_model/layers.20/self_attn/Slice_4",
+"/0/auto_model/Add",
+"/0/auto_model/Mul",
+"/0/auto_model/layers.7/self_attn/Slice_4",
+"/0/auto_model/layers.13/self_attn/Slice_4",
+"/0/auto_model/layers.9/self_attn/Slice_4",
+"/0/auto_model/layers.16/self_attn/Slice_4",
+"/0/auto_model/Unsqueeze_3",
+"/0/auto_model/ScatterND"]
+```
+</details>
 # Benchmarks
 I used beir-qdrant with the scifact dataset.
 If someone wants to sponsor me with an NVIDIA GPU I can have a much faster turnaround time with my model experiments and explore some different quantization strategies.
+onnx f32 model with f32 output (baseline):
 ```
 ndcg: {'NDCG@1': 0.57, 'NDCG@3': 0.65655, 'NDCG@5': 0.68177, 'NDCG@10': 0.69999, 'NDCG@100': 0.72749, 'NDCG@1000': 0.73301}
 precision: {'P@1': 0.57, 'P@3': 0.26111, 'P@5': 0.17467, 'P@10': 0.09467, 'P@100': 0.01083, 'P@1000': 0.00113}
 ```
+onnx dynamic uint8 model with f32 output (previous model's parent):
 ```
 ndcg: {'NDCG@1': 0.52333, 'NDCG@3': 0.58087, 'NDCG@5': 0.59811, 'NDCG@10': 0.6249, 'NDCG@100': 0.66025, 'NDCG@1000': 0.67023}
 precision: {'P@1': 0.52333, 'P@3': 0.22889, 'P@5': 0.15, 'P@10': 0.085, 'P@100': 0.0103, 'P@1000': 0.00111}
 ```
+onnx dynamic uint8 model with uint8 output (previous model):
+Note: This benchmarking better than it's parent is actually bad. I used more calibration data in the current version to avoid a repeat.
 ```
 ndcg: {'NDCG@1': 0.52667, 'NDCG@3': 0.58478, 'NDCG@5': 0.60006, 'NDCG@10': 0.62646, 'NDCG@100': 0.66175, 'NDCG@1000': 0.67171}
 precision: {'P@1': 0.52667, 'P@3': 0.23111, 'P@5': 0.15, 'P@10': 0.085, 'P@100': 0.0103, 'P@1000': 0.00111}
 ```
+onnx dynamic uint8 model with f32 output (this model's parent):
+```
+ndcg: {'NDCG@1': 0.56, 'NDCG@3': 0.63242, 'NDCG@5': 0.66258, 'NDCG@10': 0.68893, 'NDCG@100': 0.71276, 'NDCG@1000': 0.72}
+recall: {'Recall@1': 0.53094, 'Recall@3': 0.68117, 'Recall@5': 0.75417, 'Recall@10': 0.83256, 'Recall@100': 0.94, 'Recall@1000': 0.99667}
+precision: {'P@1': 0.56, 'P@3': 0.24778, 'P@5': 0.16867, 'P@10': 0.094, 'P@100': 0.0107, 'P@1000': 0.00113}
+```
+onnx dynamic uint8 model with uint8 output (this model):
+```
+ndcg: {'NDCG@1': 0.56, 'NDCG@3': 0.63119, 'NDCG@5': 0.66314, 'NDCG@10': 0.68867, 'NDCG@100': 0.71236, 'NDCG@1000': 0.7201}
+recall: {'Recall@1': 0.53094, 'Recall@3': 0.67783, 'Recall@5': 0.75583, 'Recall@10': 0.83089, 'Recall@100': 0.93667, 'Recall@1000': 0.99667}
+precision: {'P@1': 0.56, 'P@3': 0.24667, 'P@5': 0.16867, 'P@10': 0.094, 'P@100': 0.01067, 'P@1000': 0.00113}
+```
 # Example inference/benchmark code and how to use the model with Fastembed
 After installing beir-qdrant make sure to upgrade fastembed.

dynamic_uint8.onnx CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:058fecc253545b8281064f0709afdd409808fea84eaab1b149596f243cfc3da4
-size 599507984

 version https://git-lfs.github.com/spec/v1
+oid sha256:66b8032f385d841b909ec3712a6996e230fe23e548620ca0b41d6d391469c2b0
+size 654930391