electroglyph commited on
Commit
33ffbeb
·
verified ·
1 Parent(s): 2dc9879

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. README.md +535 -8
  2. dynamic_uint8.onnx +2 -2
README.md CHANGED
@@ -1,3 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
  # Qwen3-Embedding-0.6B-onnx-uint8
2
 
3
  This is an onnx version of https://huggingface.co/Qwen/Qwen3-Embedding-0.6B
@@ -9,6 +21,508 @@ This model is compatible with qdrant fastembed, please note these details:
9
  - Execute model without pooling and without normalization
10
  - Pay attention to the example query format in the code below
11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  # Benchmarks
13
 
14
  I used beir-qdrant with the scifact dataset.
@@ -19,13 +533,8 @@ I welcome any additional benchmarks by the community, please feel free to share
19
 
20
  If someone wants to sponsor me with an NVIDIA GPU I can have a much faster turnaround time with my model experiments and explore some different quantization strategies.
21
 
22
- edit: I've done pretty extensive testing, including comparing benchmarks to:
23
 
24
- https://huggingface.co/onnx-community/Qwen3-Embedding-0.6B-ONNX/blob/main/onnx/model_uint8.onnx
25
-
26
- and haven't been able to surpass this initial model.
27
-
28
- onnx f32 model with f32 output:
29
 
30
  ```
31
  ndcg: {'NDCG@1': 0.57, 'NDCG@3': 0.65655, 'NDCG@5': 0.68177, 'NDCG@10': 0.69999, 'NDCG@100': 0.72749, 'NDCG@1000': 0.73301}
@@ -33,7 +542,7 @@ recall: {'Recall@1': 0.53828, 'Recall@3': 0.71517, 'Recall@5': 0.77883, 'Recall@
33
  precision: {'P@1': 0.57, 'P@3': 0.26111, 'P@5': 0.17467, 'P@10': 0.09467, 'P@100': 0.01083, 'P@1000': 0.00113}
34
  ```
35
 
36
- onnx dynamic uint8 model with f32 output:
37
 
38
  ```
39
  ndcg: {'NDCG@1': 0.52333, 'NDCG@3': 0.58087, 'NDCG@5': 0.59811, 'NDCG@10': 0.6249, 'NDCG@100': 0.66025, 'NDCG@1000': 0.67023}
@@ -41,7 +550,9 @@ recall: {'Recall@1': 0.4965, 'Recall@3': 0.62211, 'Recall@5': 0.66622, 'Recall@1
41
  precision: {'P@1': 0.52333, 'P@3': 0.22889, 'P@5': 0.15, 'P@10': 0.085, 'P@100': 0.0103, 'P@1000': 0.00111}
42
  ```
43
 
44
- onnx dynamic uint8 model with uint8 output (this model):
 
 
45
 
46
  ```
47
  ndcg: {'NDCG@1': 0.52667, 'NDCG@3': 0.58478, 'NDCG@5': 0.60006, 'NDCG@10': 0.62646, 'NDCG@100': 0.66175, 'NDCG@1000': 0.67171}
@@ -49,6 +560,22 @@ recall: {'Recall@1': 0.49983, 'Recall@3': 0.62711, 'Recall@5': 0.66706, 'Recall@
49
  precision: {'P@1': 0.52667, 'P@3': 0.23111, 'P@5': 0.15, 'P@10': 0.085, 'P@100': 0.0103, 'P@1000': 0.00111}
50
  ```
51
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
  # Example inference/benchmark code and how to use the model with Fastembed
53
 
54
  After installing beir-qdrant make sure to upgrade fastembed.
 
1
+ # Update
2
+
3
+ I've improved the quality of the model, but size increased from 571MiB to 624MiB.
4
+
5
+ There's now only a ~1% difference in retrieval performance between this model and the full f32 model.
6
+
7
+ This model is ~6% more accurate at retrieval than the onnx-community uint8 model with f32 output.
8
+
9
+ This model is somewhere around 3.5% more accurate at retrieval than the previous version of this model.
10
+
11
+ Inference speed was the same on my hardware vs. previous model (Ryzen CPU).
12
+
13
  # Qwen3-Embedding-0.6B-onnx-uint8
14
 
15
  This is an onnx version of https://huggingface.co/Qwen/Qwen3-Embedding-0.6B
 
21
  - Execute model without pooling and without normalization
22
  - Pay attention to the example query format in the code below
23
 
24
+ # Quantization method
25
+
26
+ I created a little onnx model instrumentation framework to assist in quantization. I generated calibration data, created an instrumented onnx model, and recorded the range of values for every tensor in the model during inference. I tested different criteria for excluding nodes until I settled on what I felt was a good size/accuracy tradeoff. I ended up excluding 484 of the most sensitive nodes from quantization.
27
+
28
+ After that I generated 1 million tokens of calibration data and recorded the range of float32 outputs seen during inference.
29
+
30
+ The range I found: -0.3009805381298065 to 0.3952634334564209
31
+
32
+ I used that range for an assymmetric linear quantization from float32 -> uint8.
33
+
34
+ <details>
35
+ <summary>Here are the nodes I excluded</summary>
36
+
37
+ ```python
38
+ ["/0/auto_model/ConstantOfShape",
39
+ "/0/auto_model/Constant_28",
40
+ "/0/auto_model/layers.25/post_attention_layernorm/Pow",
41
+ "/0/auto_model/layers.26/input_layernorm/Pow",
42
+ "/0/auto_model/layers.25/input_layernorm/Pow",
43
+ "/0/auto_model/layers.24/post_attention_layernorm/Pow",
44
+ "/0/auto_model/layers.24/input_layernorm/Pow",
45
+ "/0/auto_model/layers.23/post_attention_layernorm/Pow",
46
+ "/0/auto_model/layers.23/input_layernorm/Pow",
47
+ "/0/auto_model/layers.22/post_attention_layernorm/Pow",
48
+ "/0/auto_model/layers.22/input_layernorm/Pow",
49
+ "/0/auto_model/layers.3/input_layernorm/Pow",
50
+ "/0/auto_model/layers.4/input_layernorm/Pow",
51
+ "/0/auto_model/layers.3/post_attention_layernorm/Pow",
52
+ "/0/auto_model/layers.21/post_attention_layernorm/Pow",
53
+ "/0/auto_model/layers.5/input_layernorm/Pow",
54
+ "/0/auto_model/layers.4/post_attention_layernorm/Pow",
55
+ "/0/auto_model/layers.5/post_attention_layernorm/Pow",
56
+ "/0/auto_model/layers.6/input_layernorm/Pow",
57
+ "/0/auto_model/layers.6/post_attention_layernorm/Pow",
58
+ "/0/auto_model/layers.7/input_layernorm/Pow",
59
+ "/0/auto_model/layers.8/input_layernorm/Pow",
60
+ "/0/auto_model/layers.7/post_attention_layernorm/Pow",
61
+ "/0/auto_model/layers.26/post_attention_layernorm/Pow",
62
+ "/0/auto_model/layers.9/input_layernorm/Pow",
63
+ "/0/auto_model/layers.8/post_attention_layernorm/Pow",
64
+ "/0/auto_model/layers.21/input_layernorm/Pow",
65
+ "/0/auto_model/layers.20/post_attention_layernorm/Pow",
66
+ "/0/auto_model/layers.9/post_attention_layernorm/Pow",
67
+ "/0/auto_model/layers.10/input_layernorm/Pow",
68
+ "/0/auto_model/layers.20/input_layernorm/Pow",
69
+ "/0/auto_model/layers.11/input_layernorm/Pow",
70
+ "/0/auto_model/layers.10/post_attention_layernorm/Pow",
71
+ "/0/auto_model/layers.12/input_layernorm/Pow",
72
+ "/0/auto_model/layers.11/post_attention_layernorm/Pow",
73
+ "/0/auto_model/layers.12/post_attention_layernorm/Pow",
74
+ "/0/auto_model/layers.13/input_layernorm/Pow",
75
+ "/0/auto_model/layers.19/post_attention_layernorm/Pow",
76
+ "/0/auto_model/layers.13/post_attention_layernorm/Pow",
77
+ "/0/auto_model/layers.14/input_layernorm/Pow",
78
+ "/0/auto_model/layers.19/input_layernorm/Pow",
79
+ "/0/auto_model/layers.18/post_attention_layernorm/Pow",
80
+ "/0/auto_model/layers.14/post_attention_layernorm/Pow",
81
+ "/0/auto_model/layers.15/input_layernorm/Pow",
82
+ "/0/auto_model/layers.16/input_layernorm/Pow",
83
+ "/0/auto_model/layers.15/post_attention_layernorm/Pow",
84
+ "/0/auto_model/layers.18/input_layernorm/Pow",
85
+ "/0/auto_model/layers.17/post_attention_layernorm/Pow",
86
+ "/0/auto_model/layers.17/input_layernorm/Pow",
87
+ "/0/auto_model/layers.16/post_attention_layernorm/Pow",
88
+ "/0/auto_model/layers.27/post_attention_layernorm/Pow",
89
+ "/0/auto_model/layers.27/input_layernorm/Pow",
90
+ "/0/auto_model/norm/Pow",
91
+ "/0/auto_model/layers.25/post_attention_layernorm/ReduceMean",
92
+ "/0/auto_model/layers.25/post_attention_layernorm/Add",
93
+ "/0/auto_model/layers.26/input_layernorm/Add",
94
+ "/0/auto_model/layers.26/input_layernorm/ReduceMean",
95
+ "/0/auto_model/layers.25/input_layernorm/ReduceMean",
96
+ "/0/auto_model/layers.25/input_layernorm/Add",
97
+ "/0/auto_model/layers.24/post_attention_layernorm/ReduceMean",
98
+ "/0/auto_model/layers.24/post_attention_layernorm/Add",
99
+ "/0/auto_model/layers.24/input_layernorm/Add",
100
+ "/0/auto_model/layers.24/input_layernorm/ReduceMean",
101
+ "/0/auto_model/layers.23/post_attention_layernorm/Add",
102
+ "/0/auto_model/layers.23/post_attention_layernorm/ReduceMean",
103
+ "/0/auto_model/layers.23/input_layernorm/ReduceMean",
104
+ "/0/auto_model/layers.23/input_layernorm/Add",
105
+ "/0/auto_model/layers.22/post_attention_layernorm/ReduceMean",
106
+ "/0/auto_model/layers.22/post_attention_layernorm/Add",
107
+ "/0/auto_model/layers.26/post_attention_layernorm/ReduceMean",
108
+ "/0/auto_model/layers.26/post_attention_layernorm/Add",
109
+ "/0/auto_model/layers.22/input_layernorm/ReduceMean",
110
+ "/0/auto_model/layers.22/input_layernorm/Add",
111
+ "/0/auto_model/layers.3/input_layernorm/Add",
112
+ "/0/auto_model/layers.3/input_layernorm/ReduceMean",
113
+ "/0/auto_model/layers.21/post_attention_layernorm/ReduceMean",
114
+ "/0/auto_model/layers.21/post_attention_layernorm/Add",
115
+ "/0/auto_model/layers.4/input_layernorm/Add",
116
+ "/0/auto_model/layers.4/input_layernorm/ReduceMean",
117
+ "/0/auto_model/layers.3/post_attention_layernorm/Add",
118
+ "/0/auto_model/layers.3/post_attention_layernorm/ReduceMean",
119
+ "/0/auto_model/layers.5/input_layernorm/Add",
120
+ "/0/auto_model/layers.5/input_layernorm/ReduceMean",
121
+ "/0/auto_model/layers.4/post_attention_layernorm/ReduceMean",
122
+ "/0/auto_model/layers.4/post_attention_layernorm/Add",
123
+ "/0/auto_model/layers.5/post_attention_layernorm/Add",
124
+ "/0/auto_model/layers.5/post_attention_layernorm/ReduceMean",
125
+ "/0/auto_model/layers.6/input_layernorm/Add",
126
+ "/0/auto_model/layers.6/input_layernorm/ReduceMean",
127
+ "/0/auto_model/layers.6/post_attention_layernorm/Add",
128
+ "/0/auto_model/layers.6/post_attention_layernorm/ReduceMean",
129
+ "/0/auto_model/layers.7/input_layernorm/Add",
130
+ "/0/auto_model/layers.7/input_layernorm/ReduceMean",
131
+ "/0/auto_model/layers.8/input_layernorm/ReduceMean",
132
+ "/0/auto_model/layers.8/input_layernorm/Add",
133
+ "/0/auto_model/layers.7/post_attention_layernorm/Add",
134
+ "/0/auto_model/layers.7/post_attention_layernorm/ReduceMean",
135
+ "/0/auto_model/layers.9/input_layernorm/Add",
136
+ "/0/auto_model/layers.9/input_layernorm/ReduceMean",
137
+ "/0/auto_model/layers.8/post_attention_layernorm/Add",
138
+ "/0/auto_model/layers.8/post_attention_layernorm/ReduceMean",
139
+ "/0/auto_model/layers.21/input_layernorm/Add",
140
+ "/0/auto_model/layers.21/input_layernorm/ReduceMean",
141
+ "/0/auto_model/layers.20/post_attention_layernorm/Add",
142
+ "/0/auto_model/layers.20/post_attention_layernorm/ReduceMean",
143
+ "/0/auto_model/layers.9/post_attention_layernorm/ReduceMean",
144
+ "/0/auto_model/layers.9/post_attention_layernorm/Add",
145
+ "/0/auto_model/layers.10/input_layernorm/ReduceMean",
146
+ "/0/auto_model/layers.10/input_layernorm/Add",
147
+ "/0/auto_model/layers.20/input_layernorm/Add",
148
+ "/0/auto_model/layers.20/input_layernorm/ReduceMean",
149
+ "/0/auto_model/layers.11/input_layernorm/ReduceMean",
150
+ "/0/auto_model/layers.11/input_layernorm/Add",
151
+ "/0/auto_model/layers.10/post_attention_layernorm/ReduceMean",
152
+ "/0/auto_model/layers.10/post_attention_layernorm/Add",
153
+ "/0/auto_model/layers.12/input_layernorm/ReduceMean",
154
+ "/0/auto_model/layers.12/input_layernorm/Add",
155
+ "/0/auto_model/layers.11/post_attention_layernorm/Add",
156
+ "/0/auto_model/layers.11/post_attention_layernorm/ReduceMean",
157
+ "/0/auto_model/layers.12/post_attention_layernorm/ReduceMean",
158
+ "/0/auto_model/layers.12/post_attention_layernorm/Add",
159
+ "/0/auto_model/layers.13/input_layernorm/Add",
160
+ "/0/auto_model/layers.13/input_layernorm/ReduceMean",
161
+ "/0/auto_model/layers.19/post_attention_layernorm/Add",
162
+ "/0/auto_model/layers.19/post_attention_layernorm/ReduceMean",
163
+ "/0/auto_model/layers.13/post_attention_layernorm/ReduceMean",
164
+ "/0/auto_model/layers.13/post_attention_layernorm/Add",
165
+ "/0/auto_model/layers.14/input_layernorm/Add",
166
+ "/0/auto_model/layers.14/input_layernorm/ReduceMean",
167
+ "/0/auto_model/layers.19/input_layernorm/ReduceMean",
168
+ "/0/auto_model/layers.19/input_layernorm/Add",
169
+ "/0/auto_model/layers.18/post_attention_layernorm/ReduceMean",
170
+ "/0/auto_model/layers.18/post_attention_layernorm/Add",
171
+ "/0/auto_model/layers.14/post_attention_layernorm/ReduceMean",
172
+ "/0/auto_model/layers.14/post_attention_layernorm/Add",
173
+ "/0/auto_model/layers.15/input_layernorm/ReduceMean",
174
+ "/0/auto_model/layers.15/input_layernorm/Add",
175
+ "/0/auto_model/layers.16/input_layernorm/Add",
176
+ "/0/auto_model/layers.16/input_layernorm/ReduceMean",
177
+ "/0/auto_model/layers.15/post_attention_layernorm/Add",
178
+ "/0/auto_model/layers.15/post_attention_layernorm/ReduceMean",
179
+ "/0/auto_model/layers.18/input_layernorm/Add",
180
+ "/0/auto_model/layers.18/input_layernorm/ReduceMean",
181
+ "/0/auto_model/layers.17/post_attention_layernorm/Add",
182
+ "/0/auto_model/layers.17/post_attention_layernorm/ReduceMean",
183
+ "/0/auto_model/layers.17/input_layernorm/ReduceMean",
184
+ "/0/auto_model/layers.17/input_layernorm/Add",
185
+ "/0/auto_model/layers.16/post_attention_layernorm/Add",
186
+ "/0/auto_model/layers.16/post_attention_layernorm/ReduceMean",
187
+ "/0/auto_model/layers.27/post_attention_layernorm/Add",
188
+ "/0/auto_model/layers.27/post_attention_layernorm/ReduceMean",
189
+ "/0/auto_model/layers.27/input_layernorm/Add",
190
+ "/0/auto_model/layers.27/input_layernorm/ReduceMean",
191
+ "/0/auto_model/layers.27/self_attn/q_norm/Pow",
192
+ "/0/auto_model/layers.14/self_attn/k_norm/Pow",
193
+ "/0/auto_model/layers.26/self_attn/q_norm/Pow",
194
+ "/0/auto_model/layers.25/self_attn/q_norm/Pow",
195
+ "/0/auto_model/layers.26/self_attn/k_norm/Pow",
196
+ "/0/auto_model/layers.8/self_attn/k_norm/Pow",
197
+ "/0/auto_model/layers.24/self_attn/k_norm/Pow",
198
+ "/0/auto_model/layers.24/self_attn/q_norm/Pow",
199
+ "/0/auto_model/layers.25/self_attn/k_norm/Pow",
200
+ "/0/auto_model/layers.23/self_attn/q_norm/Pow",
201
+ "/0/auto_model/layers.27/self_attn/k_norm/Pow",
202
+ "/0/auto_model/layers.12/self_attn/k_norm/Pow",
203
+ "/0/auto_model/layers.13/self_attn/k_norm/Pow",
204
+ "/0/auto_model/layers.2/mlp/down_proj/MatMul",
205
+ "/0/auto_model/layers.3/post_attention_layernorm/Cast",
206
+ "/0/auto_model/layers.3/Add",
207
+ "/0/auto_model/layers.3/Add_1",
208
+ "/0/auto_model/layers.4/input_layernorm/Cast",
209
+ "/0/auto_model/layers.3/input_layernorm/Cast",
210
+ "/0/auto_model/layers.2/Add_1",
211
+ "/0/auto_model/layers.4/Add",
212
+ "/0/auto_model/layers.4/post_attention_layernorm/Cast",
213
+ "/0/auto_model/layers.5/input_layernorm/Cast",
214
+ "/0/auto_model/layers.4/Add_1",
215
+ "/0/auto_model/layers.5/post_attention_layernorm/Cast",
216
+ "/0/auto_model/layers.5/Add",
217
+ "/0/auto_model/layers.5/Add_1",
218
+ "/0/auto_model/layers.6/input_layernorm/Cast",
219
+ "/0/auto_model/layers.7/Add_1",
220
+ "/0/auto_model/layers.8/input_layernorm/Cast",
221
+ "/0/auto_model/layers.7/Add",
222
+ "/0/auto_model/layers.7/post_attention_layernorm/Cast",
223
+ "/0/auto_model/layers.6/Add",
224
+ "/0/auto_model/layers.6/post_attention_layernorm/Cast",
225
+ "/0/auto_model/layers.6/Add_1",
226
+ "/0/auto_model/layers.7/input_layernorm/Cast",
227
+ "/0/auto_model/layers.8/Add",
228
+ "/0/auto_model/layers.8/post_attention_layernorm/Cast",
229
+ "/0/auto_model/layers.9/input_layernorm/Cast",
230
+ "/0/auto_model/layers.8/Add_1",
231
+ "/0/auto_model/layers.9/post_attention_layernorm/Cast",
232
+ "/0/auto_model/layers.9/Add",
233
+ "/0/auto_model/layers.9/Add_1",
234
+ "/0/auto_model/layers.10/input_layernorm/Cast",
235
+ "/0/auto_model/layers.11/input_layernorm/Cast",
236
+ "/0/auto_model/layers.10/Add_1",
237
+ "/0/auto_model/layers.10/Add",
238
+ "/0/auto_model/layers.10/post_attention_layernorm/Cast",
239
+ "/0/auto_model/layers.11/Add",
240
+ "/0/auto_model/layers.11/post_attention_layernorm/Cast",
241
+ "/0/auto_model/layers.11/Add_1",
242
+ "/0/auto_model/layers.12/input_layernorm/Cast",
243
+ "/0/auto_model/layers.12/Add",
244
+ "/0/auto_model/layers.12/post_attention_layernorm/Cast",
245
+ "/0/auto_model/layers.12/Add_1",
246
+ "/0/auto_model/layers.13/input_layernorm/Cast",
247
+ "/0/auto_model/layers.13/Add",
248
+ "/0/auto_model/layers.13/post_attention_layernorm/Cast",
249
+ "/0/auto_model/layers.14/input_layernorm/Cast",
250
+ "/0/auto_model/layers.13/Add_1",
251
+ "/0/auto_model/layers.14/Add_1",
252
+ "/0/auto_model/layers.15/input_layernorm/Cast",
253
+ "/0/auto_model/layers.14/post_attention_layernorm/Cast",
254
+ "/0/auto_model/layers.14/Add",
255
+ "/0/auto_model/layers.15/post_attention_layernorm/Cast",
256
+ "/0/auto_model/layers.15/Add_1",
257
+ "/0/auto_model/layers.16/input_layernorm/Cast",
258
+ "/0/auto_model/layers.15/Add",
259
+ "/0/auto_model/layers.17/input_layernorm/Cast",
260
+ "/0/auto_model/layers.16/Add_1",
261
+ "/0/auto_model/layers.16/Add",
262
+ "/0/auto_model/layers.16/post_attention_layernorm/Cast",
263
+ "/0/auto_model/layers.19/input_layernorm/Cast",
264
+ "/0/auto_model/layers.18/Add_1",
265
+ "/0/auto_model/layers.18/input_layernorm/Cast",
266
+ "/0/auto_model/layers.17/Add_1",
267
+ "/0/auto_model/layers.17/Add",
268
+ "/0/auto_model/layers.17/post_attention_layernorm/Cast",
269
+ "/0/auto_model/layers.18/post_attention_layernorm/Cast",
270
+ "/0/auto_model/layers.18/Add",
271
+ "/0/auto_model/layers.19/Add",
272
+ "/0/auto_model/layers.19/post_attention_layernorm/Cast",
273
+ "/0/auto_model/layers.22/Add_1",
274
+ "/0/auto_model/layers.23/input_layernorm/Cast",
275
+ "/0/auto_model/layers.20/Add_1",
276
+ "/0/auto_model/layers.21/input_layernorm/Cast",
277
+ "/0/auto_model/layers.21/Add_1",
278
+ "/0/auto_model/layers.22/input_layernorm/Cast",
279
+ "/0/auto_model/layers.19/Add_1",
280
+ "/0/auto_model/layers.20/input_layernorm/Cast",
281
+ "/0/auto_model/layers.24/input_layernorm/Cast",
282
+ "/0/auto_model/layers.23/Add_1",
283
+ "/0/auto_model/layers.22/Add",
284
+ "/0/auto_model/layers.22/post_attention_layernorm/Cast",
285
+ "/0/auto_model/layers.21/Add",
286
+ "/0/auto_model/layers.21/post_attention_layernorm/Cast",
287
+ "/0/auto_model/layers.20/Add",
288
+ "/0/auto_model/layers.20/post_attention_layernorm/Cast",
289
+ "/0/auto_model/layers.23/post_attention_layernorm/Cast",
290
+ "/0/auto_model/layers.23/Add",
291
+ "/0/auto_model/layers.25/input_layernorm/Cast",
292
+ "/0/auto_model/layers.24/Add_1",
293
+ "/0/auto_model/layers.24/post_attention_layernorm/Cast",
294
+ "/0/auto_model/layers.24/Add",
295
+ "/0/auto_model/layers.25/Add",
296
+ "/0/auto_model/layers.25/post_attention_layernorm/Cast",
297
+ "/0/auto_model/layers.25/Add_1",
298
+ "/0/auto_model/layers.26/input_layernorm/Cast",
299
+ "/0/auto_model/layers.26/Add",
300
+ "/0/auto_model/layers.26/post_attention_layernorm/Cast",
301
+ "/0/auto_model/layers.21/self_attn/q_norm/Pow",
302
+ "/0/auto_model/layers.26/Add_1",
303
+ "/0/auto_model/layers.27/input_layernorm/Cast",
304
+ "/0/auto_model/layers.27/Add",
305
+ "/0/auto_model/layers.27/post_attention_layernorm/Cast",
306
+ "/0/auto_model/norm/Add",
307
+ "/0/auto_model/norm/ReduceMean",
308
+ "/0/auto_model/layers.23/self_attn/k_norm/Pow",
309
+ "/0/auto_model/layers.21/self_attn/k_norm/Pow",
310
+ "/0/auto_model/layers.22/self_attn/k_norm/Pow",
311
+ "/0/auto_model/layers.10/self_attn/k_norm/Pow",
312
+ "/0/auto_model/layers.19/self_attn/q_norm/Pow",
313
+ "/0/auto_model/layers.2/mlp/Mul",
314
+ "/0/auto_model/layers.22/self_attn/q_norm/Pow",
315
+ "/0/auto_model/layers.11/self_attn/k_norm/Pow",
316
+ "/0/auto_model/layers.20/self_attn/q_norm/Pow",
317
+ "/0/auto_model/layers.20/self_attn/k_norm/Pow",
318
+ "/0/auto_model/layers.18/self_attn/q_norm/Pow",
319
+ "/0/auto_model/layers.17/self_attn/q_norm/Pow",
320
+ "/0/auto_model/layers.27/mlp/down_proj/MatMul",
321
+ "/0/auto_model/layers.19/self_attn/k_norm/Pow",
322
+ "/0/auto_model/layers.27/Add_1",
323
+ "/0/auto_model/norm/Cast",
324
+ "/0/auto_model/layers.16/self_attn/k_norm/Pow",
325
+ "/0/auto_model/layers.18/self_attn/k_norm/Pow",
326
+ "/0/auto_model/layers.11/self_attn/q_norm/Pow",
327
+ "/0/auto_model/layers.9/self_attn/q_norm/Pow",
328
+ "/0/auto_model/layers.26/self_attn/q_norm/Add",
329
+ "/0/auto_model/layers.26/self_attn/q_norm/ReduceMean",
330
+ "/0/auto_model/layers.14/self_attn/k_norm/Add",
331
+ "/0/auto_model/layers.14/self_attn/k_norm/ReduceMean",
332
+ "/0/auto_model/layers.16/self_attn/q_norm/Pow",
333
+ "/0/auto_model/layers.27/mlp/Mul",
334
+ "/0/auto_model/layers.27/self_attn/q_norm/ReduceMean",
335
+ "/0/auto_model/layers.27/self_attn/q_norm/Add",
336
+ "/0/auto_model/layers.9/self_attn/k_norm/Pow",
337
+ "/0/auto_model/layers.17/self_attn/k_norm/Pow",
338
+ "/0/auto_model/layers.26/self_attn/k_norm/ReduceMean",
339
+ "/0/auto_model/layers.26/self_attn/k_norm/Add",
340
+ "/0/auto_model/layers.25/self_attn/k_norm/Add",
341
+ "/0/auto_model/layers.25/self_attn/k_norm/ReduceMean",
342
+ "/0/auto_model/layers.13/self_attn/k_norm/Add",
343
+ "/0/auto_model/layers.13/self_attn/k_norm/ReduceMean",
344
+ "/0/auto_model/layers.10/self_attn/q_norm/Pow",
345
+ "/0/auto_model/layers.25/input_layernorm/Mul_1",
346
+ "/0/auto_model/layers.27/self_attn/k_norm/ReduceMean",
347
+ "/0/auto_model/layers.27/self_attn/k_norm/Add",
348
+ "/0/auto_model/layers.26/input_layernorm/Mul_1",
349
+ "/0/auto_model/layers.15/self_attn/q_norm/Pow",
350
+ "/0/auto_model/layers.12/self_attn/k_norm/Add",
351
+ "/0/auto_model/layers.12/self_attn/k_norm/ReduceMean",
352
+ "/0/auto_model/layers.25/self_attn/q_norm/Add",
353
+ "/0/auto_model/layers.25/self_attn/q_norm/ReduceMean",
354
+ "/0/auto_model/layers.24/input_layernorm/Mul_1",
355
+ "/0/auto_model/layers.12/self_attn/q_norm/Pow",
356
+ "/0/auto_model/layers.24/self_attn/q_norm/ReduceMean",
357
+ "/0/auto_model/layers.24/self_attn/q_norm/Add",
358
+ "/0/auto_model/layers.24/self_attn/k_norm/ReduceMean",
359
+ "/0/auto_model/layers.24/self_attn/k_norm/Add",
360
+ "/0/auto_model/layers.22/mlp/Mul",
361
+ "/0/auto_model/layers.2/post_attention_layernorm/Pow",
362
+ "/0/auto_model/layers.23/mlp/Mul",
363
+ "/0/auto_model/layers.24/mlp/Mul",
364
+ "/0/auto_model/layers.23/input_layernorm/Mul_1",
365
+ "/0/auto_model/layers.14/self_attn/q_norm/Pow",
366
+ "/0/auto_model/layers.14/self_attn/k_proj/MatMul",
367
+ "/0/auto_model/layers.14/self_attn/k_norm/Cast",
368
+ "/0/auto_model/layers.14/self_attn/Reshape_1",
369
+ "/0/auto_model/layers.21/mlp/Mul",
370
+ "/0/auto_model/layers.3/post_attention_layernorm/Sqrt",
371
+ "/0/auto_model/layers.3/input_layernorm/Sqrt",
372
+ "/0/auto_model/layers.4/input_layernorm/Sqrt",
373
+ "/0/auto_model/layers.5/input_layernorm/Sqrt",
374
+ "/0/auto_model/layers.4/post_attention_layernorm/Sqrt",
375
+ "/0/auto_model/layers.5/post_attention_layernorm/Sqrt",
376
+ "/0/auto_model/layers.6/input_layernorm/Sqrt",
377
+ "/0/auto_model/layers.6/post_attention_layernorm/Sqrt",
378
+ "/0/auto_model/layers.8/input_layernorm/Sqrt",
379
+ "/0/auto_model/layers.8/post_attention_layernorm/Sqrt",
380
+ "/0/auto_model/layers.7/post_attention_layernorm/Sqrt",
381
+ "/0/auto_model/layers.7/input_layernorm/Sqrt",
382
+ "/0/auto_model/layers.9/input_layernorm/Sqrt",
383
+ "/0/auto_model/layers.10/input_layernorm/Sqrt",
384
+ "/0/auto_model/layers.9/post_attention_layernorm/Sqrt",
385
+ "/0/auto_model/layers.11/input_layernorm/Sqrt",
386
+ "/0/auto_model/layers.10/post_attention_layernorm/Sqrt",
387
+ "/0/auto_model/layers.12/post_attention_layernorm/Sqrt",
388
+ "/0/auto_model/layers.11/post_attention_layernorm/Sqrt",
389
+ "/0/auto_model/layers.12/input_layernorm/Sqrt",
390
+ "/0/auto_model/layers.13/input_layernorm/Sqrt",
391
+ "/0/auto_model/layers.14/input_layernorm/Sqrt",
392
+ "/0/auto_model/layers.13/post_attention_layernorm/Sqrt",
393
+ "/0/auto_model/layers.15/input_layernorm/Sqrt",
394
+ "/0/auto_model/layers.14/post_attention_layernorm/Sqrt",
395
+ "/0/auto_model/layers.16/input_layernorm/Sqrt",
396
+ "/0/auto_model/layers.15/post_attention_layernorm/Sqrt",
397
+ "/0/auto_model/layers.17/input_layernorm/Sqrt",
398
+ "/0/auto_model/layers.16/post_attention_layernorm/Sqrt",
399
+ "/0/auto_model/layers.19/input_layernorm/Sqrt",
400
+ "/0/auto_model/layers.17/post_attention_layernorm/Sqrt",
401
+ "/0/auto_model/layers.18/input_layernorm/Sqrt",
402
+ "/0/auto_model/layers.18/post_attention_layernorm/Sqrt",
403
+ "/0/auto_model/layers.19/post_attention_layernorm/Sqrt",
404
+ "/0/auto_model/layers.23/input_layernorm/Sqrt",
405
+ "/0/auto_model/layers.20/input_layernorm/Sqrt",
406
+ "/0/auto_model/layers.21/input_layernorm/Sqrt",
407
+ "/0/auto_model/layers.22/input_layernorm/Sqrt",
408
+ "/0/auto_model/layers.22/post_attention_layernorm/Sqrt",
409
+ "/0/auto_model/layers.24/input_layernorm/Sqrt",
410
+ "/0/auto_model/layers.20/post_attention_layernorm/Sqrt",
411
+ "/0/auto_model/layers.21/post_attention_layernorm/Sqrt",
412
+ "/0/auto_model/layers.23/post_attention_layernorm/Sqrt",
413
+ "/0/auto_model/layers.25/input_layernorm/Sqrt",
414
+ "/0/auto_model/layers.24/post_attention_layernorm/Sqrt",
415
+ "/0/auto_model/layers.25/post_attention_layernorm/Sqrt",
416
+ "/0/auto_model/layers.26/input_layernorm/Sqrt",
417
+ "/0/auto_model/layers.26/post_attention_layernorm/Sqrt",
418
+ "/0/auto_model/layers.15/self_attn/k_norm/Pow",
419
+ "/0/auto_model/layers.27/input_layernorm/Sqrt",
420
+ "/0/auto_model/layers.27/post_attention_layernorm/Sqrt",
421
+ "/0/auto_model/layers.2/input_layernorm/Pow",
422
+ "/0/auto_model/layers.26/mlp/Mul",
423
+ "/0/auto_model/layers.23/self_attn/q_norm/Add",
424
+ "/0/auto_model/layers.23/self_attn/q_norm/ReduceMean",
425
+ "/0/auto_model/layers.13/self_attn/q_norm/Pow",
426
+ "/0/auto_model/layers.21/self_attn/q_norm/Add",
427
+ "/0/auto_model/layers.21/self_attn/q_norm/ReduceMean",
428
+ "/0/auto_model/layers.6/self_attn/q_norm/Pow",
429
+ "/0/auto_model/layers.27/self_attn/Reshape_7",
430
+ "/0/auto_model/layers.27/self_attn/MatMul_1",
431
+ "/0/auto_model/layers.27/self_attn/Transpose_4",
432
+ "/0/auto_model/layers.26/self_attn/Expand_1",
433
+ "/0/auto_model/layers.26/self_attn/Unsqueeze_19",
434
+ "/0/auto_model/layers.26/self_attn/v_proj/MatMul",
435
+ "/0/auto_model/layers.26/self_attn/Transpose_2",
436
+ "/0/auto_model/layers.26/self_attn/Reshape_6",
437
+ "/0/auto_model/layers.26/self_attn/Reshape_2",
438
+ "/0/auto_model/layers.11/self_attn/k_norm/ReduceMean",
439
+ "/0/auto_model/layers.11/self_attn/k_norm/Add",
440
+ "/0/auto_model/layers.22/input_layernorm/Mul_1",
441
+ "/0/auto_model/layers.25/mlp/Mul",
442
+ "/0/auto_model/layers.8/self_attn/k_norm/Cast",
443
+ "/0/auto_model/layers.8/self_attn/k_proj/MatMul",
444
+ "/0/auto_model/layers.8/self_attn/Reshape_1",
445
+ "/0/auto_model/layers.21/input_layernorm/Mul_1",
446
+ "/0/auto_model/layers.5/self_attn/q_norm/Pow",
447
+ "/0/auto_model/layers.22/self_attn/q_norm/ReduceMean",
448
+ "/0/auto_model/layers.22/self_attn/q_norm/Add",
449
+ "/0/auto_model/layers.22/mlp/down_proj/MatMul",
450
+ "/0/auto_model/layers.23/self_attn/k_norm/ReduceMean",
451
+ "/0/auto_model/layers.23/self_attn/k_norm/Add",
452
+ "/0/auto_model/layers.23/mlp/down_proj/MatMul",
453
+ "/0/auto_model/layers.26/mlp/down_proj/MatMul",
454
+ "/0/auto_model/layers.1/self_attn/Add_2",
455
+ "/0/auto_model/layers.2/self_attn/Add_2",
456
+ "/0/auto_model/layers.6/self_attn/Add_2",
457
+ "/0/auto_model/layers.11/self_attn/Add_2",
458
+ "/0/auto_model/layers.12/self_attn/Add_2",
459
+ "/0/auto_model/layers.16/self_attn/Add_2",
460
+ "/0/auto_model/layers.21/self_attn/Add_2",
461
+ "/0/auto_model/layers.24/self_attn/Add_2",
462
+ "/0/auto_model/layers.0/self_attn/Add_2",
463
+ "/0/auto_model/layers.8/self_attn/Add_2",
464
+ "/0/auto_model/layers.13/self_attn/Add_2",
465
+ "/0/auto_model/layers.26/self_attn/Add_2",
466
+ "/0/auto_model/layers.3/self_attn/Add_2",
467
+ "/0/auto_model/layers.15/self_attn/Add_2",
468
+ "/0/auto_model/layers.25/self_attn/Add_2",
469
+ "/0/auto_model/layers.4/self_attn/Add_2",
470
+ "/0/auto_model/layers.14/self_attn/Add_2",
471
+ "/0/auto_model/layers.22/self_attn/Add_2",
472
+ "/0/auto_model/layers.9/self_attn/Add_2",
473
+ "/0/auto_model/layers.23/self_attn/Add_2",
474
+ "/0/auto_model/layers.10/self_attn/Add_2",
475
+ "/0/auto_model/layers.5/self_attn/Add_2",
476
+ "/0/auto_model/layers.19/self_attn/Add_2",
477
+ "/0/auto_model/layers.7/self_attn/Add_2",
478
+ "/0/auto_model/layers.27/self_attn/Add_2",
479
+ "/0/auto_model/layers.18/self_attn/Add_2",
480
+ "/0/auto_model/layers.20/self_attn/Add_2",
481
+ "/0/auto_model/layers.17/self_attn/Add_2",
482
+ "/0/auto_model/Slice_1",
483
+ "/0/auto_model/layers.5/self_attn/Slice_4",
484
+ "/0/auto_model/layers.12/self_attn/Slice_4",
485
+ "/0/auto_model/layers.18/self_attn/Slice_4",
486
+ "/0/auto_model/layers.3/self_attn/Slice_4",
487
+ "/0/auto_model/layers.11/self_attn/Slice_4",
488
+ "/0/auto_model/layers.22/self_attn/Slice_4",
489
+ "/0/auto_model/Expand",
490
+ "/0/auto_model/layers.4/self_attn/Slice_4",
491
+ "/0/auto_model/Slice_2",
492
+ "/0/auto_model/layers.8/self_attn/Slice_4",
493
+ "/0/auto_model/layers.2/self_attn/Slice_4",
494
+ "/0/auto_model/layers.15/self_attn/Slice_4",
495
+ "/0/auto_model/layers.26/self_attn/Slice_4",
496
+ "/0/auto_model/layers.24/self_attn/Slice_4",
497
+ "/0/auto_model/Expand_1",
498
+ "/0/auto_model/layers.14/self_attn/Slice_4",
499
+ "/0/auto_model/layers.21/self_attn/Slice_4",
500
+ "/0/auto_model/layers.1/self_attn/Slice_4",
501
+ "/0/auto_model/Reshape_2",
502
+ "/0/auto_model/layers.19/self_attn/Slice_4",
503
+ "/0/auto_model/Slice",
504
+ "/0/auto_model/layers.6/self_attn/Slice_4",
505
+ "/0/auto_model/layers.0/self_attn/Slice_4",
506
+ "/0/auto_model/layers.25/self_attn/Slice_4",
507
+ "/0/auto_model/Unsqueeze_4",
508
+ "/0/auto_model/layers.10/self_attn/Slice_4",
509
+ "/0/auto_model/layers.23/self_attn/Slice_4",
510
+ "/0/auto_model/layers.17/self_attn/Slice_4",
511
+ "/0/auto_model/Where_1",
512
+ "/0/auto_model/layers.27/self_attn/Slice_4",
513
+ "/0/auto_model/layers.20/self_attn/Slice_4",
514
+ "/0/auto_model/Add",
515
+ "/0/auto_model/Mul",
516
+ "/0/auto_model/layers.7/self_attn/Slice_4",
517
+ "/0/auto_model/layers.13/self_attn/Slice_4",
518
+ "/0/auto_model/layers.9/self_attn/Slice_4",
519
+ "/0/auto_model/layers.16/self_attn/Slice_4",
520
+ "/0/auto_model/Unsqueeze_3",
521
+ "/0/auto_model/ScatterND"]
522
+ ```
523
+
524
+ </details>
525
+
526
  # Benchmarks
527
 
528
  I used beir-qdrant with the scifact dataset.
 
533
 
534
  If someone wants to sponsor me with an NVIDIA GPU I can have a much faster turnaround time with my model experiments and explore some different quantization strategies.
535
 
 
536
 
537
+ onnx f32 model with f32 output (baseline):
 
 
 
 
538
 
539
  ```
540
  ndcg: {'NDCG@1': 0.57, 'NDCG@3': 0.65655, 'NDCG@5': 0.68177, 'NDCG@10': 0.69999, 'NDCG@100': 0.72749, 'NDCG@1000': 0.73301}
 
542
  precision: {'P@1': 0.57, 'P@3': 0.26111, 'P@5': 0.17467, 'P@10': 0.09467, 'P@100': 0.01083, 'P@1000': 0.00113}
543
  ```
544
 
545
+ onnx dynamic uint8 model with f32 output (previous model's parent):
546
 
547
  ```
548
  ndcg: {'NDCG@1': 0.52333, 'NDCG@3': 0.58087, 'NDCG@5': 0.59811, 'NDCG@10': 0.6249, 'NDCG@100': 0.66025, 'NDCG@1000': 0.67023}
 
550
  precision: {'P@1': 0.52333, 'P@3': 0.22889, 'P@5': 0.15, 'P@10': 0.085, 'P@100': 0.0103, 'P@1000': 0.00111}
551
  ```
552
 
553
+ onnx dynamic uint8 model with uint8 output (previous model):
554
+
555
+ Note: This benchmarking better than it's parent is actually bad. I used more calibration data in the current version to avoid a repeat.
556
 
557
  ```
558
  ndcg: {'NDCG@1': 0.52667, 'NDCG@3': 0.58478, 'NDCG@5': 0.60006, 'NDCG@10': 0.62646, 'NDCG@100': 0.66175, 'NDCG@1000': 0.67171}
 
560
  precision: {'P@1': 0.52667, 'P@3': 0.23111, 'P@5': 0.15, 'P@10': 0.085, 'P@100': 0.0103, 'P@1000': 0.00111}
561
  ```
562
 
563
+ onnx dynamic uint8 model with f32 output (this model's parent):
564
+
565
+ ```
566
+ ndcg: {'NDCG@1': 0.56, 'NDCG@3': 0.63242, 'NDCG@5': 0.66258, 'NDCG@10': 0.68893, 'NDCG@100': 0.71276, 'NDCG@1000': 0.72}
567
+ recall: {'Recall@1': 0.53094, 'Recall@3': 0.68117, 'Recall@5': 0.75417, 'Recall@10': 0.83256, 'Recall@100': 0.94, 'Recall@1000': 0.99667}
568
+ precision: {'P@1': 0.56, 'P@3': 0.24778, 'P@5': 0.16867, 'P@10': 0.094, 'P@100': 0.0107, 'P@1000': 0.00113}
569
+ ```
570
+
571
+ onnx dynamic uint8 model with uint8 output (this model):
572
+
573
+ ```
574
+ ndcg: {'NDCG@1': 0.56, 'NDCG@3': 0.63119, 'NDCG@5': 0.66314, 'NDCG@10': 0.68867, 'NDCG@100': 0.71236, 'NDCG@1000': 0.7201}
575
+ recall: {'Recall@1': 0.53094, 'Recall@3': 0.67783, 'Recall@5': 0.75583, 'Recall@10': 0.83089, 'Recall@100': 0.93667, 'Recall@1000': 0.99667}
576
+ precision: {'P@1': 0.56, 'P@3': 0.24667, 'P@5': 0.16867, 'P@10': 0.094, 'P@100': 0.01067, 'P@1000': 0.00113}
577
+ ```
578
+
579
  # Example inference/benchmark code and how to use the model with Fastembed
580
 
581
  After installing beir-qdrant make sure to upgrade fastembed.
dynamic_uint8.onnx CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:058fecc253545b8281064f0709afdd409808fea84eaab1b149596f243cfc3da4
3
- size 599507984
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:66b8032f385d841b909ec3712a6996e230fe23e548620ca0b41d6d391469c2b0
3
+ size 654930391