Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,227 @@
|
|
1 |
-
---
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
library_name: transformers
|
3 |
+
license: apache-2.0
|
4 |
+
license_link: https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507-FP8/blob/main/LICENSE
|
5 |
+
pipeline_tag: text-generation
|
6 |
+
base_model:
|
7 |
+
- Qwen/Qwen3-235B-A22B-Instruct-2507
|
8 |
+
---
|
9 |
+
|
10 |
+
# Qwen3-235B-A22B-Instruct-2507-FP8
|
11 |
+
<a href="https://chat.qwen.ai/" target="_blank" style="margin: 2px;">
|
12 |
+
<img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block; vertical-align: middle;"/>
|
13 |
+
</a>
|
14 |
+
|
15 |
+
## Highlights
|
16 |
+
|
17 |
+
We introduce the updated version of the **Qwen3-235B-A22B-FP8 non-thinking mode**, named **Qwen3-235B-A22B-Instruct-2507-FP8**, featuring the following key enhancements:
|
18 |
+
|
19 |
+
- **Significant improvements** in general capabilities, including **instruction following, logical reasoning, text comprehension, mathematics, science, coding and tool usage**.
|
20 |
+
- **Substantial gains** in long-tail knowledge coverage across **multiple languages**.
|
21 |
+
- **Markedly better alignment** with user preferences in **subjective and open-ended tasks**, enabling more helpful responses and higher-quality text generation.
|
22 |
+
- **Enhanced capabilities** in **256K long-context understanding**.
|
23 |
+
|
24 |
+
## Model Overview
|
25 |
+
|
26 |
+
This repo contains the FP8 version of **Qwen3-235B-A22B-Instruct-2507**, which has the following features:
|
27 |
+
- Type: Causal Language Models
|
28 |
+
- Training Stage: Pretraining & Post-training
|
29 |
+
- Number of Parameters: 235B in total and 22B activated
|
30 |
+
- Number of Paramaters (Non-Embedding): 234B
|
31 |
+
- Number of Layers: 94
|
32 |
+
- Number of Attention Heads (GQA): 64 for Q and 4 for KV
|
33 |
+
- Number of Experts: 128
|
34 |
+
- Number of Activated Experts: 8
|
35 |
+
- Context Length: **262,144 natively**.
|
36 |
+
|
37 |
+
**NOTE: this model only supports the non-thinking mode. In other words, there are no ``<think></think>`` blocks in the generated outputs. Meanwhile, `enable_thinking=False` is no longer needed to be specified.**
|
38 |
+
|
39 |
+
For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our [blog](https://qwenlm.github.io/blog/qwen3/), [GitHub](https://github.com/QwenLM/Qwen3), and [Documentation](https://qwen.readthedocs.io/en/latest/).
|
40 |
+
|
41 |
+
|
42 |
+
## Performance
|
43 |
+
|
44 |
+
|
45 |
+
| | Deepseek V3-0324 | GPT-4o-0327 | Claude Opus 4 non-thinking | Kimi-K2 | Qwen3-235B-A22B non-thinking | Qwen3-235B-A22B-Instruct-2507 |
|
46 |
+
|--- | --- | --- | --- | --- | --- | ---|
|
47 |
+
| **Knowledge** | | | | | | |
|
48 |
+
| MMLU | 88.3 | 89.7 | 91.7 | 89.5 | 86.0 | 89.7 |
|
49 |
+
| MMLU-Pro | 81.2 | 79.8 | 83.7 | 81.1 | 75.2 | 83.0 |
|
50 |
+
| MMLU-Redux | 90.4 | 91.3 | 93.1 | 92.7 | 89.2 | 93.1 |
|
51 |
+
| GPQA | 68.4 | 66.9 | 88.9 | 75.1 | 62.9 | 77.5 |
|
52 |
+
| SuperGPQA | 57.3 | 51.0 | 58.0 | 57.2 | 48.2 | 62.6 |
|
53 |
+
| SimpleQA | 27.2 | 40.3 | 15.9 | 31.0 | 12.2 | 54.3 |
|
54 |
+
| CSimpleQA | 71.1 | 60.2 | 59.5 | 74.5 | 60.8 | 84.3 |
|
55 |
+
| **Reasoning** | | | | | | |
|
56 |
+
| AIME24 | 59.4 | 32.5 | 43.4 | 69.6 | 40.1 | 82.0 |
|
57 |
+
| AIME25 | 46.6 | 26.7 | 33.1 | 49.5 | 24.7 | 70.3 |
|
58 |
+
| HMMT25 | 25.0 | 7.92 | 15.42 | ? | 10.0 | 55.4 |
|
59 |
+
| ArcAGI | 9.0 | 8.8 | 28.3 | 13.3 | 4.3 | 41.8 |
|
60 |
+
| ZebraLogic | 83.4 | 52.6 | 79.7 | 89.0 | 37.7 | 95.0 |
|
61 |
+
| LiveBench1125 | 66.8 | 63.7 | 74.8 | 76.4 | 62.5 | 75.4 |
|
62 |
+
| **Coding** | | | | | | |
|
63 |
+
| LCBv6 (25.02 - 25.05) | 45.2 | 35.8 | 44.6 | 48.9 | ing | 51.8 |
|
64 |
+
| MultiPL-E | 82.2 | 82.7 | 88.5 | 83.1 | ing | 87.9 |
|
65 |
+
| Aider | 55.1 | 45.3 | 70.7 | 59.0 | 59.6 | 57.3 |
|
66 |
+
| **Instruction Following** | | | | | | |
|
67 |
+
| SIFO | 62.5 | 64.9 | 75.8 | 60.6 | 53.2 | 58.5 |
|
68 |
+
| SIFO-multiturn | 59.1 | 64.4 | 66.9 | 62.7 | 47.3 | 61.9 |
|
69 |
+
| IFEval | 82.3 | 83.9 | 88.9 | 89.8 | 83.2 | 88.7 |
|
70 |
+
| **Open-ended tasks** | | | | | | |
|
71 |
+
| Arena-Hard v2 (win rate) gpt-4.1 as judger | 45.6 | 61.9 | 46.6 | 66.1 | 52.0 | 79.2 |
|
72 |
+
| Creative Writing v3 | 81.6 | 84.9 | 83.1 | 88.1 | 80.4 | 87.5 |
|
73 |
+
| WritingBench | 74.5 | 75.5 | 79.7 | 86.2 | 77.0 | 85.2 |
|
74 |
+
| **Agent** | | | | | | |
|
75 |
+
| BFCL-v3 | 64.7 | 66.5 | 60.1 | 65.2 | 68.0 | 70.9 |
|
76 |
+
| TAU-Retail | 49.6 | 60.3(gpt-4o-20241120) | 81.4* | 70.7 | 65.2 | 71.3 |
|
77 |
+
| TAU-Airline | 32.0 | 42.8(gpt-4o-20241120) | 59.6* | 53.5 | 32.0 | 44.0 |
|
78 |
+
| **Multilingualism** | | | | | | |
|
79 |
+
| MultiIF | 66.5 | 70.4 | - | 76.2 | 70.2 | 77.5 |
|
80 |
+
| MMLU-ProX | 75.8 | 76.2 | - | 74.5 | 73.2 | 79.4 |
|
81 |
+
| INCLUDE | 80.1 | 82.1 | - | 76.9 | 75.6 | 79.5 |
|
82 |
+
| PolyMATH | 32.2 | 25.5 | 30.0 | 44.8 | 27.0 | 50.2 |
|
83 |
+
|
84 |
+
|
85 |
+
|
86 |
+
## Quickstart
|
87 |
+
|
88 |
+
The code of Qwen3-MoE has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`.
|
89 |
+
|
90 |
+
With `transformers<4.51.0`, you will encounter the following error:
|
91 |
+
```
|
92 |
+
KeyError: 'qwen3_moe'
|
93 |
+
```
|
94 |
+
|
95 |
+
The following contains a code snippet illustrating how to use the model generate content based on given inputs.
|
96 |
+
```python
|
97 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
98 |
+
|
99 |
+
model_name = "Qwen/Qwen3-235B-A22B-Instruct-2507-FP8"
|
100 |
+
|
101 |
+
# load the tokenizer and the model
|
102 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
103 |
+
model = AutoModelForCausalLM.from_pretrained(
|
104 |
+
model_name,
|
105 |
+
torch_dtype="auto",
|
106 |
+
device_map="auto"
|
107 |
+
)
|
108 |
+
|
109 |
+
# prepare the model input
|
110 |
+
prompt = "Give me a short introduction to large language model."
|
111 |
+
messages = [
|
112 |
+
{"role": "user", "content": prompt}
|
113 |
+
]
|
114 |
+
text = tokenizer.apply_chat_template(
|
115 |
+
messages,
|
116 |
+
tokenize=False,
|
117 |
+
add_generation_prompt=True,
|
118 |
+
)
|
119 |
+
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
|
120 |
+
|
121 |
+
# conduct text completion
|
122 |
+
generated_ids = model.generate(
|
123 |
+
**model_inputs,
|
124 |
+
max_new_tokens=16384
|
125 |
+
)
|
126 |
+
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
|
127 |
+
|
128 |
+
content = tokenizer.decode(output_ids, skip_special_tokens=True)
|
129 |
+
|
130 |
+
print("content:", content)
|
131 |
+
```
|
132 |
+
|
133 |
+
For deployment, you can use `sglang>=0.4.6.post1` or `vllm>=0.8.5` or to create an OpenAI-compatible API endpoint:
|
134 |
+
- SGLang:
|
135 |
+
```shell
|
136 |
+
python -m sglang.launch_server --model-path Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 --tp 4 --context-length 262144
|
137 |
+
```
|
138 |
+
- vLLM:
|
139 |
+
```shell
|
140 |
+
vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 --tensor-parallel-size 4 --max-model-len 262144
|
141 |
+
```
|
142 |
+
|
143 |
+
Note: If you encounter out-of-memory (OOM) issues, consider reducing the context length to a shorter value, such as 32,768.
|
144 |
+
|
145 |
+
For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3.
|
146 |
+
|
147 |
+
## Note on FP8
|
148 |
+
|
149 |
+
For convenience and performance, we have provided `fp8`-quantized model checkpoint for Qwen3, whose name ends with `-FP8`. The quantization method is fine-grained `fp8` quantization with block size of 128. You can find more details in the `quantization_config` field in `config.json`.
|
150 |
+
|
151 |
+
You can use the Qwen3-235B-A22B-Instruct-2507-FP8 model with serveral inference frameworks, including `transformers`, `sglang`, and `vllm`, as the original bfloat16 model.
|
152 |
+
However, please pay attention to the following known issues:
|
153 |
+
- `transformers`:
|
154 |
+
- there are currently issues with the "fine-grained fp8" method in `transformers` for distributed inference. You may need to set the environment variable `CUDA_LAUNCH_BLOCKING=1` if multiple devices are used in inference.
|
155 |
+
|
156 |
+
## Agentic Use
|
157 |
+
|
158 |
+
Qwen3 excels in tool calling capabilities. We recommend using [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent) to make the best use of agentic ability of Qwen3. Qwen-Agent encapsulates tool-calling templates and tool-calling parsers internally, greatly reducing coding complexity.
|
159 |
+
|
160 |
+
To define the available tools, you can use the MCP configuration file, use the integrated tool of Qwen-Agent, or integrate other tools by yourself.
|
161 |
+
```python
|
162 |
+
from qwen_agent.agents import Assistant
|
163 |
+
|
164 |
+
# Define LLM
|
165 |
+
llm_cfg = {
|
166 |
+
'model': 'Qwen3-235B-A22B-Instruct-2507-FP8',
|
167 |
+
|
168 |
+
# Use a custom endpoint compatible with OpenAI API:
|
169 |
+
'model_server': 'http://localhost:8000/v1', # api_base
|
170 |
+
'api_key': 'EMPTY',
|
171 |
+
}
|
172 |
+
|
173 |
+
# Define Tools
|
174 |
+
tools = [
|
175 |
+
{'mcpServers': { # You can specify the MCP configuration file
|
176 |
+
'time': {
|
177 |
+
'command': 'uvx',
|
178 |
+
'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai']
|
179 |
+
},
|
180 |
+
"fetch": {
|
181 |
+
"command": "uvx",
|
182 |
+
"args": ["mcp-server-fetch"]
|
183 |
+
}
|
184 |
+
}
|
185 |
+
},
|
186 |
+
'code_interpreter', # Built-in tools
|
187 |
+
]
|
188 |
+
|
189 |
+
# Define Agent
|
190 |
+
bot = Assistant(llm=llm_cfg, function_list=tools)
|
191 |
+
|
192 |
+
# Streaming generation
|
193 |
+
messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}]
|
194 |
+
for responses in bot.run(messages=messages):
|
195 |
+
pass
|
196 |
+
print(responses)
|
197 |
+
```
|
198 |
+
|
199 |
+
## Best Practices
|
200 |
+
|
201 |
+
To achieve optimal performance, we recommend the following settings:
|
202 |
+
|
203 |
+
1. **Sampling Parameters**:
|
204 |
+
- We suggest using `Temperature=0.7`, `TopP=0.8`, `TopK=20`, and `MinP=0`.
|
205 |
+
- For supported frameworks, you can adjust the `presence_penalty` parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance.
|
206 |
+
|
207 |
+
2. **Adequate Output Length**: We recommend using an output length of 16,384 tokens for most queries, which is adequate for instruct models.
|
208 |
+
|
209 |
+
3. **Standardize Output Format**: We recommend using prompts to standardize model outputs when benchmarking.
|
210 |
+
- **Math Problems**: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt.
|
211 |
+
- **Multiple-Choice Questions**: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the `answer` field with only the choice letter, e.g., `"answer": "C"`."
|
212 |
+
|
213 |
+
### Citation
|
214 |
+
|
215 |
+
If you find our work helpful, feel free to give us a cite.
|
216 |
+
|
217 |
+
```
|
218 |
+
@misc{qwen3technicalreport,
|
219 |
+
title={Qwen3 Technical Report},
|
220 |
+
author={Qwen Team},
|
221 |
+
year={2025},
|
222 |
+
eprint={2505.09388},
|
223 |
+
archivePrefix={arXiv},
|
224 |
+
primaryClass={cs.CL},
|
225 |
+
url={https://arxiv.org/abs/2505.09388},
|
226 |
+
}
|
227 |
+
```
|