How to increase speed when using CPU

#48
by aki0923 - opened

Is there any codes that can deploy a fast embedding service using cpu without performance damage?

I just went down this rabbit hole using Docker for a few hours and landed on the below optimal command for running this model on a mini pc with an Intel Core i9 13900HK (45w mobile CPU). This CPU has 6 Pcores with hyper threading and 10 slower E cores without hyper threading (20 logical cores in total), and with my optimizations I got the average latency down to 30ms for a minimal embedding request.

Important optimizations for my minimal latency / low throughput workloads were:

  1. Use the cpu-ipex* tagged container. Much faster on this Intel cpu than the cpu tagged containers.
  2. Restrict Docker to running on the P core logical cores ONLY (--cpuset-cpus="0-11")
  3. DO use hyperthreading on those P cores. Hyperthreading improved latency by ~ 50% vs without.
  4. Limit thread counts to equal the 12 logical cores: -e OMP_NUM_THREADS=12 -e MKL_NUM_THREADS=12
  5. Limiting the input tokens to 1/4 max was required to avoid running out of memory on model warmup.

Here's the command - hope it helps others πŸ˜€ :

docker run \
  -d \
  --rm \
  -p 8041:80 \
  -v /home/<user_goes_here>/qwen3-embed-data:/data \
  --cpuset-cpus="0-11" \
  -e OMP_NUM_THREADS=12 \
  -e MKL_NUM_THREADS=12 \
  --pull always \
  ghcr.io/huggingface/text-embeddings-inference:cpu-ipex-latest \
  --model-id Qwen/Qwen3-Embedding-0.6B \
  --dtype bfloat16 \
  --max-batch-tokens 8192 \
  --auto-truncate

Sign up or log in to comment