Qwen3-Embedding-4B โ ONNX FP16
ONNX Runtime export of Qwen/Qwen3-Embedding-4B at fp16 (half-precision float), intended for GPU / Metal inference (fp16 accelerated).
Qwen3-Embedding is a decoder-style embedding model: it runs a Qwen3 causal LM over the input, then pools by taking the last token's hidden state of the final layer (i.e. the hidden state at position attention_mask.sum() - 1). The output embedding dimension is 2560.
Variant details
- Precision: fp16 (half-precision float)
- ONNX opset: 18 (optimum default at export time)
- Export task:
feature-extraction(--library-name transformers) - Inputs:
input_ids,attention_mask,position_ids(allint64, dynamicbatch_size/sequence_length) - Output:
last_hidden_state(float16, shape[batch, seq, 2560]) โ apply last-token pooling + L2 norm yourself - Sequence length at export: 512 (dynamic axis, so longer inputs still work; 512 keeps tracing memory manageable)
- On-disk size: 8.2 GB (ONNX graph + external data; tokenizer/config not counted)
How it was produced
Exported from the HF PyTorch checkpoint with optimum-cli export onnx --dtype fp16. The weights and matmuls are cast to fp16 during the ONNX trace.
Full pipeline:
pip install "optimum[exporters,onnxruntime]"
# FP32 ONNX (intermediate)
optimum-cli export onnx \
--model Qwen/Qwen3-Embedding-4B \
--task feature-extraction \
--library-name transformers \
--sequence_length 512 --batch_size 1 \
out/fp32
For FP16, we additionally pass --dtype fp16 --device cpu to the export command.
Quickstart
from pathlib import Path
import numpy as np
import onnxruntime as ort
from huggingface_hub import snapshot_download
from transformers import AutoTokenizer
repo_id = "majentik/Qwen3-Embedding-4B-ONNX-FP16"
local = snapshot_download(repo_id=repo_id)
tok = AutoTokenizer.from_pretrained(local)
sess = ort.InferenceSession(
str(Path(local) / "model.onnx"),
providers=["CPUExecutionProvider"], # or CUDAExecutionProvider / CoreMLExecutionProvider
)
in_names = {i.name for i in sess.get_inputs()}
def embed(texts: list[str]) -> np.ndarray:
enc = tok(texts, return_tensors="np", padding=True, truncation=True, max_length=512)
feeds = {k: v for k, v in enc.items() if k in in_names}
if "position_ids" in in_names:
seq = enc["input_ids"].shape[1]
feeds["position_ids"] = np.broadcast_to(
np.arange(seq, dtype=np.int64)[None, :], enc["input_ids"].shape,
).copy()
last_hidden = sess.run(None, feeds)[0]
# last-token pooling (Qwen3-Embedding official recipe)
lengths = enc["attention_mask"].sum(axis=1) - 1
pooled = last_hidden[np.arange(len(lengths)), lengths].astype(np.float32)
return pooled / np.linalg.norm(pooled, axis=-1, keepdims=True)
vecs = embed([
"The capital of France is Paris.",
"La capitale de la France est Paris.",
])
print(vecs.shape) # (2, 2560)
print("cosine:", float(vecs[0] @ vecs[1]))
With Optimum ORTModelForFeatureExtraction
from optimum.onnxruntime import ORTModelForFeatureExtraction
from transformers import AutoTokenizer
repo_id = "majentik/Qwen3-Embedding-4B-ONNX-FP16"
tok = AutoTokenizer.from_pretrained(repo_id)
model = ORTModelForFeatureExtraction.from_pretrained(repo_id, file_name="model.onnx")
Caveats
last_hidden_stateis not pooled or L2-normalized โ do it yourself (see the quickstart).- The export was traced at
sequence_length=512. The axes are dynamic so longer inputs still work, but very long contexts will use more memory than the PyTorch path because thefeature-extractionONNX graph does not reuse past key/values. - INT8 variants are weight-quantized only (dynamic activation quantization at runtime). Expect a small quality drop vs fp32; if that matters, use the FP16 variant.
License
Apache-2.0, inherited from the base model.
See also
- Base model: Qwen/Qwen3-Embedding-4B
- MTEB leaderboard: https://huggingface.co/spaces/mteb/leaderboard
- Garden hub: majentik/garden
- Sibling ONNX variants:
- Downloads last month
- 4