ONNX model output
ONNX output is different from the output of Transformers and SentenceTransformer.
I checked the onnx model using netron.app, and the text_embeds, and 13049 corresponds to the sequence_output and pooled_output of the XLMRobertaModel. However, when I compared it with the model loaded from model.safetensors, I found that the results were different.
Hi @Riddler2024 , have you tried running inference the way we demonstrate in the README? If there's a slight difference, it could be because ONNX uses fp32, while ST or HF may use bf16 when running in a GPU environment. If this doesn't resolve the issue, please share a code snippet so I can reproduce the behavior.
I am using the example code from the README. I extracted the onnx model output 13049 and compared it with the output of XLMRobertaModel.forwardin the xlm-roberta-flash-implementation repository, and all outputs were not normalized.
@Riddler2024
, ok I see the issue. ONNX model mimics the forward function, which doesn’t apply any normalization by itself, however both HF encode and SentenceTransformers include a normalization step. This is why the outputs differ. I’d suggest applying the normalization yourself after running inference. Would that be convenient for your application?