File size: 11,172 Bytes
ed02f45 02e8a01 a06f9c6 02e8a01 f5114e7 69d1549 02e8a01 f5114e7 02e8a01 f5114e7 02e8a01 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 |
---
license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen3-0.6B
pipeline_tag: text-generation
library_name: transformers
tags:
- Qwen
- Qwen3
- Int8
---
# Qwen3-0.6B-Int8
This version of Qwen3-0.6B-Int8 has been converted to run on the Axera NPU using **w8a16** quantization.
This model has been optimized with the following LoRA:
Compatible with Pulsar2 version: 4.2(Not released yet)
## Convert tools links:
For those who are interested in model conversion, you can try to export axmodel through the original repo :
https://huggingface.co/Qwen/Qwen3-0.6B
[Pulsar2 Link, How to Convert LLM from Huggingface to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/appendix/build_llm.html)
[AXera NPU LLM Runtime](https://github.com/AXERA-TECH/ax-llm)
## Support Platform
- AX650
- [M4N-Dock(η±θ―ζ΄ΎPro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
- [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html)
- AX630C
- *developing*
|Chips|w8a16|w4a16|
|--|--|--|
|AX650| 20 tokens/sec|TBD|
## How to use
Download all files from this repository to the device
```
root@ax650:/mnt/qtang/llm-test/qwen3-0.6b# tree -L 1
.
|-- main_ax650
|-- main_axcl_aarch64
|-- main_axcl_x86
|-- post_config.json
|-- qwen2.5_tokenizer
|-- qwen3-0.6b-ax650
|-- qwen3_tokenizer
|-- qwen3_tokenizer_uid.py
|-- run_qwen3_0.6b_int8_ctx_ax650.sh
|-- run_qwen3_0.6b_int8_ctx_axcl_aarch64.sh
`-- run_qwen3_0.6b_int8_ctx_axcl_x86.sh
```
#### Start the Tokenizer service
Install requirement
```
pip install transformers jinja2
```
```
root@ax650:/mnt/qtang/llm-test/qwen3-0.6b# python3 qwen3_tokenizer_uid.py
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Server running at http://0.0.0.0:12345
```
#### Inference with AX650 Host, such as M4N-Dock(η±θ―ζ΄ΎPro) or AX650N DEMO Board
Open another terminal and run `run_qwen3_0.6b_int8_ctx_ax650.sh`
```
root@ax650:/mnt/qtang/llm-test/qwen3-0.6b# ./run_qwen3_0.6b_int8_ctx_ax650.sh
[I][ Init][ 110]: LLM init start
[I][ Init][ 34]: connect http://127.0.0.1:12345 ok
[I][ Init][ 57]: uid: 8199112b-da8a-4f39-ae48-9d83f422b2d3
bos_id: -1, eos_id: 151645
3% | ββ | 1 / 31 [3.76s<116.56s, 0.27 count/s] tokenizer init ok
[I][ Init][ 26]: LLaMaEmbedSelector use mmap
100% | ββββββββββββββββββββββββββββββββ | 31 / 31 [6.18s<6.18s, 5.01 count/s] init post axmodel ok,remain_cmm(10021 MB)
[I][ Init][ 188]: max_token_len : 2559
[I][ Init][ 193]: kv_cache_size : 1024, kv_cache_num: 2559
[I][ Init][ 201]: prefill_token_num : 128
[I][ Init][ 205]: grp: 1, prefill_max_token_num : 1
[I][ Init][ 205]: grp: 2, prefill_max_token_num : 512
[I][ Init][ 205]: grp: 3, prefill_max_token_num : 1024
[I][ Init][ 205]: grp: 4, prefill_max_token_num : 1536
[I][ Init][ 205]: grp: 5, prefill_max_token_num : 2048
[I][ Init][ 209]: prefill_max_token_num : 2048
[I][ load_config][ 282]: load config:
{
"enable_repetition_penalty": false,
"enable_temperature": false,
"enable_top_k_sampling": true,
"enable_top_p_sampling": false,
"penalty_window": 20,
"repetition_penalty": 1.2,
"temperature": 0.9,
"top_k": 1,
"top_p": 0.8
}
[I][ Init][ 218]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
[I][ GenerateKVCachePrefill][ 270]: input token num : 21, prefill_split_num : 1 prefill_grpid : 2
[I][ GenerateKVCachePrefill][ 307]: input_num_token:21
[I][ main][ 230]: precompute_len: 21
[I][ main][ 231]: system_prompt: You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
prompt >> who are you
[I][ SetKVCache][ 530]: prefill_grpid:2 kv_cache_num:512 precompute_len:57 input_num_token:14
[I][ SetKVCache][ 533]: current prefill_max_token_num:1920
[I][ Run][ 659]: input token num : 14, prefill_split_num : 1
[I][ Run][ 685]: input_num_token:14
[I][ Run][ 808]: ttft: 586.92 ms
<think>
</think>
I'm Qwen, a large language model developed by Alibaba Cloud. I can help with a wide range of tasks,
from answering questions to writing code, providing information, and even assisting with creative projects.
Let me know what you need!
[N][ Run][ 922]: hit eos,avg 19.01 token/s
[I][ GetKVCache][ 499]: precompute_len:123, remaining:1925
prompt >> q
root@ax650:/mnt/qtang/llm-test/qwen3-0.6b#
```
#### Inference with M.2 Accelerator card
[What is M.2 Accelerator card?](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html), Show this DEMO based on Raspberry PI 5.
```
(base) axera@raspberrypi:~/samples/qwen3-0.6b $ ./run_qwen3_0.6b_int8_ctx_axcl_aarch64.sh
[I][ Init][ 136]: LLM init start
[I][ Init][ 34]: connect http://127.0.0.1:12345 ok
[I][ Init][ 57]: uid: afec8311-55c9-4785-9fed-949368362b0e
bos_id: -1, eos_id: 151645
3% | ββ | 1 / 31 [1.00s<31.12s, 1.00 count/s] tokenizer init ok
[I][ Init][ 45]: LLaMaEmbedSelector use mmap
6% | βββ | 2 / 31 [1.00s<15.56s, 1.99 count/s] embed_selector init ok
[I][ run][ 30]: AXCLWorker start with devid 0
100% | ββββββββββββββββββββββββββββββββ | 31 / 31 [28.32s<28.32s, 1.09 count/s] init post axmodel ok,remain_cmm(5068 MB)
[I][ Init][ 237]: max_token_len : 2559
[I][ Init][ 240]: kv_cache_size : 1024, kv_cache_num: 2559
[I][ Init][ 248]: prefill_token_num : 128
[I][ Init][ 252]: grp: 1, prefill_max_token_num : 1
[I][ Init][ 252]: grp: 2, prefill_max_token_num : 512
[I][ Init][ 252]: grp: 3, prefill_max_token_num : 1024
[I][ Init][ 252]: grp: 4, prefill_max_token_num : 1536
[I][ Init][ 252]: grp: 5, prefill_max_token_num : 2048
[I][ Init][ 256]: prefill_max_token_num : 2048
________________________
| ID| remain cmm(MB)|
========================
| 0| 5068|
Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―
[I][ load_config][ 282]: load config:
{
"enable_repetition_penalty": false,
"enable_temperature": false,
"enable_top_k_sampling": true,
"enable_top_p_sampling": false,
"penalty_window": 20,
"repetition_penalty": 1.2,
"temperature": 0.9,
"top_k": 1,
"top_p": 0.8
}
[I][ Init][ 279]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
[I][ GenerateKVCachePrefill][ 335]: input token num : 21, prefill_split_num : 1 prefill_grpid : 2
[I][ GenerateKVCachePrefill][ 372]: input_num_token:21
[I][ main][ 236]: precompute_len: 21
[I][ main][ 237]: system_prompt: You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
prompt >> who are you?
[I][ SetKVCache][ 628]: prefill_grpid:2 kv_cache_num:512 precompute_len:21 input_num_token:16
[I][ SetKVCache][ 631]: current prefill_max_token_num:1920
[I][ Run][ 869]: input token num : 16, prefill_split_num : 1
[I][ Run][ 901]: input_num_token:16
[I][ Run][1030]: ttft: 670.05 ms
<think>
</think>
I am Qwen, a large language model developed by Alibaba Cloud.
I am designed to assist with a wide range of tasks and provide helpful information.
If you have any questions or need assistance, feel free to ask!
[N][ Run][1182]: hit eos,avg 13.06 token/s
[I][ GetKVCache][ 597]: precompute_len:85, remaining:1963
prompt >> what can you do?
[I][ SetKVCache][ 628]: prefill_grpid:2 kv_cache_num:512 precompute_len:85 input_num_token:17
[I][ SetKVCache][ 631]: current prefill_max_token_num:1920
[I][ Run][ 869]: input token num : 17, prefill_split_num : 1
[I][ Run][ 901]: input_num_token:17
[I][ Run][1030]: ttft: 671.29 ms
<think>
</think>
I can help with a variety of tasks and provide assistance in different areas. For example, I can:
- Answer questions about technology, science, culture, and more.
- Help with writing, research, and problem-solving.
- Provide information and support in different languages.
- Assist with tasks such as writing, coding, and data analysis.
Let me know what you need!
[N][ Run][1182]: hit eos,avg 13.05 token/s
[I][ GetKVCache][ 597]: precompute_len:181, remaining:1867
prompt >> q
(base) axera@raspberrypi:~ $ axcl-smi
+------------------------------------------------------------------------------------------------+
| AXCL-SMI V3.4.0_20250423020139 Driver V3.4.0_20250423020139 |
+-----------------------------------------+--------------+---------------------------------------+
| Card Name Firmware | Bus-Id | Memory-Usage |
| Fan Temp Pwr:Usage/Cap | CPU NPU | CMM-Usage |
|=========================================+==============+=======================================|
| 0 AX650N V3.4.0 | 0000:01:00.0 | 182 MiB / 945 MiB |
| -- 35C -- / -- | 1% 0% | 971 MiB / 7040 MiB |
+-----------------------------------------+--------------+---------------------------------------+
+------------------------------------------------------------------------------------------------+
| Processes: |
| Card PID Process Name NPU Memory Usage |
|================================================================================================|
| 0 53261 /home/axera/samples/qwen3-0.6b/main_axcl_aarch64 953772 KiB |
+------------------------------------------------------------------------------------------------+
```
|