Qwen3-Embedding-4B โ€” ONNX FP16

ONNX Runtime export of Qwen/Qwen3-Embedding-4B at fp16 (half-precision float), intended for GPU / Metal inference (fp16 accelerated).

Qwen3-Embedding is a decoder-style embedding model: it runs a Qwen3 causal LM over the input, then pools by taking the last token's hidden state of the final layer (i.e. the hidden state at position attention_mask.sum() - 1). The output embedding dimension is 2560.

Variant details

  • Precision: fp16 (half-precision float)
  • ONNX opset: 18 (optimum default at export time)
  • Export task: feature-extraction (--library-name transformers)
  • Inputs: input_ids, attention_mask, position_ids (all int64, dynamic batch_size / sequence_length)
  • Output: last_hidden_state (float16, shape [batch, seq, 2560]) โ€” apply last-token pooling + L2 norm yourself
  • Sequence length at export: 512 (dynamic axis, so longer inputs still work; 512 keeps tracing memory manageable)
  • On-disk size: 8.2 GB (ONNX graph + external data; tokenizer/config not counted)

How it was produced

Exported from the HF PyTorch checkpoint with optimum-cli export onnx --dtype fp16. The weights and matmuls are cast to fp16 during the ONNX trace.

Full pipeline:

pip install "optimum[exporters,onnxruntime]"

# FP32 ONNX (intermediate)
optimum-cli export onnx \
  --model Qwen/Qwen3-Embedding-4B \
  --task feature-extraction \
  --library-name transformers \
  --sequence_length 512 --batch_size 1 \
  out/fp32

For FP16, we additionally pass --dtype fp16 --device cpu to the export command.

Quickstart

from pathlib import Path

import numpy as np
import onnxruntime as ort
from huggingface_hub import snapshot_download
from transformers import AutoTokenizer

repo_id = "majentik/Qwen3-Embedding-4B-ONNX-FP16"
local = snapshot_download(repo_id=repo_id)

tok = AutoTokenizer.from_pretrained(local)
sess = ort.InferenceSession(
    str(Path(local) / "model.onnx"),
    providers=["CPUExecutionProvider"],  # or CUDAExecutionProvider / CoreMLExecutionProvider
)
in_names = {i.name for i in sess.get_inputs()}

def embed(texts: list[str]) -> np.ndarray:
    enc = tok(texts, return_tensors="np", padding=True, truncation=True, max_length=512)
    feeds = {k: v for k, v in enc.items() if k in in_names}
    if "position_ids" in in_names:
        seq = enc["input_ids"].shape[1]
        feeds["position_ids"] = np.broadcast_to(
            np.arange(seq, dtype=np.int64)[None, :], enc["input_ids"].shape,
        ).copy()
    last_hidden = sess.run(None, feeds)[0]
    # last-token pooling (Qwen3-Embedding official recipe)
    lengths = enc["attention_mask"].sum(axis=1) - 1
    pooled = last_hidden[np.arange(len(lengths)), lengths].astype(np.float32)
    return pooled / np.linalg.norm(pooled, axis=-1, keepdims=True)

vecs = embed([
    "The capital of France is Paris.",
    "La capitale de la France est Paris.",
])
print(vecs.shape)  # (2, 2560)
print("cosine:", float(vecs[0] @ vecs[1]))

With Optimum ORTModelForFeatureExtraction

from optimum.onnxruntime import ORTModelForFeatureExtraction
from transformers import AutoTokenizer

repo_id = "majentik/Qwen3-Embedding-4B-ONNX-FP16"
tok = AutoTokenizer.from_pretrained(repo_id)
model = ORTModelForFeatureExtraction.from_pretrained(repo_id, file_name="model.onnx")

Caveats

  • last_hidden_state is not pooled or L2-normalized โ€” do it yourself (see the quickstart).
  • The export was traced at sequence_length=512. The axes are dynamic so longer inputs still work, but very long contexts will use more memory than the PyTorch path because the feature-extraction ONNX graph does not reuse past key/values.
  • INT8 variants are weight-quantized only (dynamic activation quantization at runtime). Expect a small quality drop vs fp32; if that matters, use the FP16 variant.

License

Apache-2.0, inherited from the base model.

See also

Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for majentik/Qwen3-Embedding-4B-ONNX-FP16

Quantized
(34)
this model