How to increase speed when using CPU
#48
by aki0923 - opened
Is there any codes that can deploy a fast embedding service using cpu without performance damage?
I just went down this rabbit hole using Docker for a few hours and landed on the below optimal command for running this model on a mini pc with an Intel Core i9 13900HK (45w mobile CPU). This CPU has 6 Pcores with hyper threading and 10 slower E cores without hyper threading (20 logical cores in total), and with my optimizations I got the average latency down to 30ms for a minimal embedding request.
Important optimizations for my minimal latency / low throughput workloads were:
- Use the
cpu-ipex*tagged container. Much faster on this Intel cpu than thecputagged containers. - Restrict Docker to running on the P core logical cores ONLY (
--cpuset-cpus="0-11") - DO use hyperthreading on those P cores. Hyperthreading improved latency by ~ 50% vs without.
- Limit thread counts to equal the 12 logical cores:
-e OMP_NUM_THREADS=12 -e MKL_NUM_THREADS=12 - Limiting the input tokens to 1/4 max was required to avoid running out of memory on model warmup.
Here's the command - hope it helps others π :
docker run \
-d \
--rm \
-p 8041:80 \
-v /home/<user_goes_here>/qwen3-embed-data:/data \
--cpuset-cpus="0-11" \
-e OMP_NUM_THREADS=12 \
-e MKL_NUM_THREADS=12 \
--pull always \
ghcr.io/huggingface/text-embeddings-inference:cpu-ipex-latest \
--model-id Qwen/Qwen3-Embedding-0.6B \
--dtype bfloat16 \
--max-batch-tokens 8192 \
--auto-truncate