Language Model Inference on Qualcomm NPU with ONNX Runtime

Pre-requisite

  • A Windows laptop (or access to X86_64 Python on Windows)
  • Qualcomm ID (to login to Qualcomm Device Cloud (QDC))
  • Install onnxruntime-qnn and onnx Python package with X86_64 Python: pip install onnxruntime-qnn onnx

Running Inference for Qwen 2.5 0.5B on NPU

Step 1: Model Preparation

Download the ONNX model for Qwen 2.5 0.5B from Huggingface. Please select model.onnx since pre-quantized models may not be compatible with QNN. If you wish to try other models, you can always find the model through ONNX model Explorer.

Then, we need to prepare the FP32 model for quantization. Run the following command:

python -m onnxruntime.quantization.preprocess --input model.onnx --output qwen-pre.onnx --skip_optimization true --auto_merge

This pre-process tool will perform symbolic reference, optimization, and shape reference. For Qwen 2.5 0.5B, we use --skip_optimization since this tool cannot handle model larger or equal to 2GB in optimization, and --auto_merge can merge common shapes for symbolic reference. For other models, you may want to turn on optimization.

Please go to step 2 directly to generate an ONNX model without static shape (qwen-nofixed.qdq.onnx) for decoding accuracy validation on CPU later. Then, you can come back and continue.

Next, we need to fix the shape of all inputs for the model, since QNN execution provider does not support dynamic shape yet. A useful tool to inspect ONNX model is Netron. Open Netron on your web-browser and upload qwen-pre.onnx. Click the circle-dot button at the bottom left (which is for graph property). You will see the following:

You can inspect the shape of all inputs. In our case, we need to change batch_size to 1, sequence_length to 1, and past_sequence_length to 1. Also, we need to change the shape of attention_mask to [1, 2] to align the shape. You can change it if you want to test decoding with longer KV cache length.

python -m onnxruntime.tools.make_dynamic_shape_fixed --dim_param batch_size --dim_value 1 qwen-pre.onnx qwen-fixed.onnx
python -m onnxruntime.tools.make_dynamic_shape_fixed --dim_param sequence_length --dim_value 1 qwen-fixed.onnx qwen-fixed.onnx
python -m onnxruntime.tools.make_dynamic_shape_fixed --dim_param past_sequence_length --dim_value 1 qwen-fixed.onnx qwen-fixed.onnx
python -m onnxruntime.tools.make_dynamic_shape_fixed --input_name attention_mask --input_shape 1,2 qwen-fixed.onnx qwen-fixed.onnx

After processing, the graph property will be like this:

Step 2: Quantize Model to Fit QNN

Run python quantize_qwen.py to generate the QDQ model. The script perform three steps:

  1. Create dataloader for calibration dataset. Here I just use random data for illustration. You can use your own calibration set, but it has to follow this shape.
  2. Preprocess the model for QNN execution provider. The previous pre-process is for CPU EP. This is for QNN EP with additional optimizations.
  3. Quantize the model. Here we set weight as unsigned INT8 and activation as unsigned INT16. You can change the precision based on your own quantization scheme, but it cannot be FP32 since Qualcomm NPU does not support it.

You will get a file named qwen.qdq.onnx. For convenience, I have also generated it if you want to directly use it.

For validation, you can run the model on CPU by python run.py. Please run it to make sure your model works and generate the correct value.

Step 3: On-device Execution

Log in to QDC and request for 1000 mins for Snapdragon X-Elite laptop first. Create a new folder with qwen.qdq.onnx and run_qnn.py and zip it. Then, create an interactive session for X-Elite laptop and upload the zipped package (there is a place for you to upload software. Just use it). You will see a remote Windows desktop.

Open the browser and install Python for Windows with ARM64 Installer. Do not install the X86_64 version. After that, install ONNX Runtime in Powershell.

pip install onnxruntime-qnn onnx

Direct your current working directory to C:\\Temp\file\your_zip_package_name. This is where your uploaded files located. Open the task manager and run the script

python run_qnn.py

It will first process the ONNX graph and then start execution. You will see that the utilization of NPU is increasing to 96% on the Task Manager.

Step 4: Profiling on Qualcomm AI Hub

Qualcomm AI Hub is a tool to run profiling on Qualcomm devices. The Python API only works on X86 environment, so remember to switch to Python X86_64.

First, install Qualcomm AI Hub library:

python -m pip install qai_hub

Next, go to Qualcomm AI Hub website and log in with your Qualcomm ID. Navigate to Account -> Settings -> API Token. Copy the API token. Then, run the following command to register your account:

qai-hub configure --api_token INSERT_API_TOKEN

Finally, run python profile.py and wait. You will see a new job on your AI hub account under profiling tab. Once it's finished, you can review the profiling result on the webpage, and check more detail in the qwen_profile.json file.

Step 5: [CPU EP] Model Decoding Test

You can run run_decode.py with your quantized model that has dynamic shape to check the real output. This will be used to test the accuracy for your quantization scheme (along with PPL).

Reference Documentation

https://onnxruntime.ai/docs/execution-providers/QNN-ExecutionProvider.html#ep-provider-options

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OswaldHe123/cs259_qnn_starter

Base model

Qwen/Qwen2.5-0.5B
Quantized
(88)
this model