How to increase speed when using CPU

#48

by aki0923 - opened Dec 23, 2025

Discussion

aki0923

Dec 23, 2025

Is there any codes that can deploy a fast embedding service using cpu without performance damage?

sazlin

Jan 14

•

edited Jan 14

I just went down this rabbit hole using Docker for a few hours and landed on the below optimal command for running this model on a mini pc with an Intel Core i9 13900HK (45w mobile CPU). This CPU has 6 Pcores with hyper threading and 10 slower E cores without hyper threading (20 logical cores in total), and with my optimizations I got the average latency down to 30ms for a minimal embedding request.

Important optimizations for my minimal latency / low throughput workloads were:

Use the cpu-ipex* tagged container. Much faster on this Intel cpu than the cpu tagged containers.
Restrict Docker to running on the P core logical cores ONLY (--cpuset-cpus="0-11")
DO use hyperthreading on those P cores. Hyperthreading improved latency by ~ 50% vs without.
Limit thread counts to equal the 12 logical cores: -e OMP_NUM_THREADS=12 -e MKL_NUM_THREADS=12
Limiting the input tokens to 1/4 max was required to avoid running out of memory on model warmup.

Here's the command - hope it helps others 😀 :

docker run \
  -d \
  --rm \
  -p 8041:80 \
  -v /home/<user_goes_here>/qwen3-embed-data:/data \
  --cpuset-cpus="0-11" \
  -e OMP_NUM_THREADS=12 \
  -e MKL_NUM_THREADS=12 \
  --pull always \
  ghcr.io/huggingface/text-embeddings-inference:cpu-ipex-latest \
  --model-id Qwen/Qwen3-Embedding-0.6B \
  --dtype bfloat16 \
  --max-batch-tokens 8192 \
  --auto-truncate

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment