littlebird13 commited on
Commit
3b6a4f3
·
verified ·
1 Parent(s): 5d145b0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +227 -3
README.md CHANGED
@@ -1,3 +1,227 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ license_link: https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507-FP8/blob/main/LICENSE
5
+ pipeline_tag: text-generation
6
+ base_model:
7
+ - Qwen/Qwen3-235B-A22B-Instruct-2507
8
+ ---
9
+
10
+ # Qwen3-235B-A22B-Instruct-2507-FP8
11
+ <a href="https://chat.qwen.ai/" target="_blank" style="margin: 2px;">
12
+ <img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block; vertical-align: middle;"/>
13
+ </a>
14
+
15
+ ## Highlights
16
+
17
+ We introduce the updated version of the **Qwen3-235B-A22B-FP8 non-thinking mode**, named **Qwen3-235B-A22B-Instruct-2507-FP8**, featuring the following key enhancements:
18
+
19
+ - **Significant improvements** in general capabilities, including **instruction following, logical reasoning, text comprehension, mathematics, science, coding and tool usage**.
20
+ - **Substantial gains** in long-tail knowledge coverage across **multiple languages**.
21
+ - **Markedly better alignment** with user preferences in **subjective and open-ended tasks**, enabling more helpful responses and higher-quality text generation.
22
+ - **Enhanced capabilities** in **256K long-context understanding**.
23
+
24
+ ## Model Overview
25
+
26
+ This repo contains the FP8 version of **Qwen3-235B-A22B-Instruct-2507**, which has the following features:
27
+ - Type: Causal Language Models
28
+ - Training Stage: Pretraining & Post-training
29
+ - Number of Parameters: 235B in total and 22B activated
30
+ - Number of Paramaters (Non-Embedding): 234B
31
+ - Number of Layers: 94
32
+ - Number of Attention Heads (GQA): 64 for Q and 4 for KV
33
+ - Number of Experts: 128
34
+ - Number of Activated Experts: 8
35
+ - Context Length: **262,144 natively**.
36
+
37
+ **NOTE: this model only supports the non-thinking mode. In other words, there are no ``<think></think>`` blocks in the generated outputs. Meanwhile, `enable_thinking=False` is no longer needed to be specified.**
38
+
39
+ For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our [blog](https://qwenlm.github.io/blog/qwen3/), [GitHub](https://github.com/QwenLM/Qwen3), and [Documentation](https://qwen.readthedocs.io/en/latest/).
40
+
41
+
42
+ ## Performance
43
+
44
+
45
+ | | Deepseek V3-0324 | GPT-4o-0327 | Claude Opus 4 non-thinking | Kimi-K2 | Qwen3-235B-A22B non-thinking | Qwen3-235B-A22B-Instruct-2507 |
46
+ |--- | --- | --- | --- | --- | --- | ---|
47
+ | **Knowledge** | | | | | | |
48
+ | MMLU | 88.3 | 89.7 | 91.7 | 89.5 | 86.0 | 89.7 |
49
+ | MMLU-Pro | 81.2 | 79.8 | 83.7 | 81.1 | 75.2 | 83.0 |
50
+ | MMLU-Redux | 90.4 | 91.3 | 93.1 | 92.7 | 89.2 | 93.1 |
51
+ | GPQA | 68.4 | 66.9 | 88.9 | 75.1 | 62.9 | 77.5 |
52
+ | SuperGPQA | 57.3 | 51.0 | 58.0 | 57.2 | 48.2 | 62.6 |
53
+ | SimpleQA | 27.2 | 40.3 | 15.9 | 31.0 | 12.2 | 54.3 |
54
+ | CSimpleQA | 71.1 | 60.2 | 59.5 | 74.5 | 60.8 | 84.3 |
55
+ | **Reasoning** | | | | | | |
56
+ | AIME24 | 59.4 | 32.5 | 43.4 | 69.6 | 40.1 | 82.0 |
57
+ | AIME25 | 46.6 | 26.7 | 33.1 | 49.5 | 24.7 | 70.3 |
58
+ | HMMT25 | 25.0 | 7.92 | 15.42 | ? | 10.0 | 55.4 |
59
+ | ArcAGI | 9.0 | 8.8 | 28.3 | 13.3 | 4.3 | 41.8 |
60
+ | ZebraLogic | 83.4 | 52.6 | 79.7 | 89.0 | 37.7 | 95.0 |
61
+ | LiveBench1125 | 66.8 | 63.7 | 74.8 | 76.4 | 62.5 | 75.4 |
62
+ | **Coding** | | | | | | |
63
+ | LCBv6 (25.02 - 25.05) | 45.2 | 35.8 | 44.6 | 48.9 | ing | 51.8 |
64
+ | MultiPL-E | 82.2 | 82.7 | 88.5 | 83.1 | ing | 87.9 |
65
+ | Aider | 55.1 | 45.3 | 70.7 | 59.0 | 59.6 | 57.3 |
66
+ | **Instruction Following** | | | | | | |
67
+ | SIFO | 62.5 | 64.9 | 75.8 | 60.6 | 53.2 | 58.5 |
68
+ | SIFO-multiturn | 59.1 | 64.4 | 66.9 | 62.7 | 47.3 | 61.9 |
69
+ | IFEval | 82.3 | 83.9 | 88.9 | 89.8 | 83.2 | 88.7 |
70
+ | **Open-ended tasks** | | | | | | |
71
+ | Arena-Hard v2 (win rate) gpt-4.1 as judger | 45.6 | 61.9 | 46.6 | 66.1 | 52.0 | 79.2 |
72
+ | Creative Writing v3 | 81.6 | 84.9 | 83.1 | 88.1 | 80.4 | 87.5 |
73
+ | WritingBench | 74.5 | 75.5 | 79.7 | 86.2 | 77.0 | 85.2 |
74
+ | **Agent** | | | | | | |
75
+ | BFCL-v3 | 64.7 | 66.5 | 60.1 | 65.2 | 68.0 | 70.9 |
76
+ | TAU-Retail | 49.6 | 60.3(gpt-4o-20241120) | 81.4* | 70.7 | 65.2 | 71.3 |
77
+ | TAU-Airline | 32.0 | 42.8(gpt-4o-20241120) | 59.6* | 53.5 | 32.0 | 44.0 |
78
+ | **Multilingualism** | | | | | | |
79
+ | MultiIF | 66.5 | 70.4 | - | 76.2 | 70.2 | 77.5 |
80
+ | MMLU-ProX | 75.8 | 76.2 | - | 74.5 | 73.2 | 79.4 |
81
+ | INCLUDE | 80.1 | 82.1 | - | 76.9 | 75.6 | 79.5 |
82
+ | PolyMATH | 32.2 | 25.5 | 30.0 | 44.8 | 27.0 | 50.2 |
83
+
84
+
85
+
86
+ ## Quickstart
87
+
88
+ The code of Qwen3-MoE has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`.
89
+
90
+ With `transformers<4.51.0`, you will encounter the following error:
91
+ ```
92
+ KeyError: 'qwen3_moe'
93
+ ```
94
+
95
+ The following contains a code snippet illustrating how to use the model generate content based on given inputs.
96
+ ```python
97
+ from transformers import AutoModelForCausalLM, AutoTokenizer
98
+
99
+ model_name = "Qwen/Qwen3-235B-A22B-Instruct-2507-FP8"
100
+
101
+ # load the tokenizer and the model
102
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
103
+ model = AutoModelForCausalLM.from_pretrained(
104
+ model_name,
105
+ torch_dtype="auto",
106
+ device_map="auto"
107
+ )
108
+
109
+ # prepare the model input
110
+ prompt = "Give me a short introduction to large language model."
111
+ messages = [
112
+ {"role": "user", "content": prompt}
113
+ ]
114
+ text = tokenizer.apply_chat_template(
115
+ messages,
116
+ tokenize=False,
117
+ add_generation_prompt=True,
118
+ )
119
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
120
+
121
+ # conduct text completion
122
+ generated_ids = model.generate(
123
+ **model_inputs,
124
+ max_new_tokens=16384
125
+ )
126
+ output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
127
+
128
+ content = tokenizer.decode(output_ids, skip_special_tokens=True)
129
+
130
+ print("content:", content)
131
+ ```
132
+
133
+ For deployment, you can use `sglang>=0.4.6.post1` or `vllm>=0.8.5` or to create an OpenAI-compatible API endpoint:
134
+ - SGLang:
135
+ ```shell
136
+ python -m sglang.launch_server --model-path Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 --tp 4 --context-length 262144
137
+ ```
138
+ - vLLM:
139
+ ```shell
140
+ vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 --tensor-parallel-size 4 --max-model-len 262144
141
+ ```
142
+
143
+ Note: If you encounter out-of-memory (OOM) issues, consider reducing the context length to a shorter value, such as 32,768.
144
+
145
+ For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3.
146
+
147
+ ## Note on FP8
148
+
149
+ For convenience and performance, we have provided `fp8`-quantized model checkpoint for Qwen3, whose name ends with `-FP8`. The quantization method is fine-grained `fp8` quantization with block size of 128. You can find more details in the `quantization_config` field in `config.json`.
150
+
151
+ You can use the Qwen3-235B-A22B-Instruct-2507-FP8 model with serveral inference frameworks, including `transformers`, `sglang`, and `vllm`, as the original bfloat16 model.
152
+ However, please pay attention to the following known issues:
153
+ - `transformers`:
154
+ - there are currently issues with the "fine-grained fp8" method in `transformers` for distributed inference. You may need to set the environment variable `CUDA_LAUNCH_BLOCKING=1` if multiple devices are used in inference.
155
+
156
+ ## Agentic Use
157
+
158
+ Qwen3 excels in tool calling capabilities. We recommend using [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent) to make the best use of agentic ability of Qwen3. Qwen-Agent encapsulates tool-calling templates and tool-calling parsers internally, greatly reducing coding complexity.
159
+
160
+ To define the available tools, you can use the MCP configuration file, use the integrated tool of Qwen-Agent, or integrate other tools by yourself.
161
+ ```python
162
+ from qwen_agent.agents import Assistant
163
+
164
+ # Define LLM
165
+ llm_cfg = {
166
+ 'model': 'Qwen3-235B-A22B-Instruct-2507-FP8',
167
+
168
+ # Use a custom endpoint compatible with OpenAI API:
169
+ 'model_server': 'http://localhost:8000/v1', # api_base
170
+ 'api_key': 'EMPTY',
171
+ }
172
+
173
+ # Define Tools
174
+ tools = [
175
+ {'mcpServers': { # You can specify the MCP configuration file
176
+ 'time': {
177
+ 'command': 'uvx',
178
+ 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai']
179
+ },
180
+ "fetch": {
181
+ "command": "uvx",
182
+ "args": ["mcp-server-fetch"]
183
+ }
184
+ }
185
+ },
186
+ 'code_interpreter', # Built-in tools
187
+ ]
188
+
189
+ # Define Agent
190
+ bot = Assistant(llm=llm_cfg, function_list=tools)
191
+
192
+ # Streaming generation
193
+ messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}]
194
+ for responses in bot.run(messages=messages):
195
+ pass
196
+ print(responses)
197
+ ```
198
+
199
+ ## Best Practices
200
+
201
+ To achieve optimal performance, we recommend the following settings:
202
+
203
+ 1. **Sampling Parameters**:
204
+ - We suggest using `Temperature=0.7`, `TopP=0.8`, `TopK=20`, and `MinP=0`.
205
+ - For supported frameworks, you can adjust the `presence_penalty` parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance.
206
+
207
+ 2. **Adequate Output Length**: We recommend using an output length of 16,384 tokens for most queries, which is adequate for instruct models.
208
+
209
+ 3. **Standardize Output Format**: We recommend using prompts to standardize model outputs when benchmarking.
210
+ - **Math Problems**: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt.
211
+ - **Multiple-Choice Questions**: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the `answer` field with only the choice letter, e.g., `"answer": "C"`."
212
+
213
+ ### Citation
214
+
215
+ If you find our work helpful, feel free to give us a cite.
216
+
217
+ ```
218
+ @misc{qwen3technicalreport,
219
+ title={Qwen3 Technical Report},
220
+ author={Qwen Team},
221
+ year={2025},
222
+ eprint={2505.09388},
223
+ archivePrefix={arXiv},
224
+ primaryClass={cs.CL},
225
+ url={https://arxiv.org/abs/2505.09388},
226
+ }
227
+ ```