Zhi-Create-Qwen3-32B
1. Introduction
Zhi-Create-Qwen3-32B is a fine-tuned model derived from Qwen/Qwen3-32B, with a focus on enhancing creative writing capabilities. Through careful optimization, the model shows promising improvements in creative writing performance, as evaluated using the WritingBench. In our evaluation, the model attains a score of 82.08 on WritingBench, which represents a significant improvement over the base Qwen3-32B model's score of 78.97.
Additionally, to maintain the model's general capabilities such as knowledge and reasoning, we performed fine-grained data mixture experiments by combining general knowledge, mathematics, code, and other data types. The final evaluation results show that general capabilities remain stable with no significant decline compared to the base model.
2. Training Process
Data
The model's training corpus comprises three primary data sources: rigorously filtered open-source datasets, synthesized chain-of-thought reasoning corpora, and curated question-answer pairs from Zhihu.
To achieve optimal domain coverage, we meticulously balanced the distribution of various datasets through data mixture optimization experiments. These datasets include Dolphin-r1, Congliu/Chinese-DeepSeek-R1-Distill-data-110k, a-m-team/AM-DeepSeek-R1-0528-Distilled, alongside high-quality content from Zhihu. All datasets underwent comprehensive quality assurance through our Reward Model (RM) filtering pipeline. To guarantee the model’s foundational knowledge and reasoning capabilities, creative writing data accounted for approximately 23% of the training data, with the remainder consisting of mathematics, code, and fundamental general knowledge data. The chain-of-thought (CoT) reasoning components in the training data was synthesized using deepseek-ai/DeepSeek-R1-0528 and other similar models.
The detailed data distribution is shown in the figure below:
Training
Supervised Fine-tuning (SFT): We employed a curriculum learning strategy for supervised fine-tuning. This methodical approach systematically enhances creative writing capabilities while incorporating diverse domain data to maintain core competencies and mitigate catastrophic forgetting. Adopting a multi-stage progressive iteration method, we select samples that were insufficiently trained in previous rounds and categorize samples by reasoning complexity and context length. This allows us to gradually increase the difficulty of training samples, achieving step-by-step enhancement in model performance.
Direct Preference Optimization (DPO): We integrate the RAFT(Reward-Ranked Fine-Tuning) method, combining rule-based systems and LLM-as-judge approaches to identify correct and incorrect samples. This enables the construction of DPO preference sample pairs to address issues such as Chinese-English code-mixing and undesirable repetition in the model, while simultaneously improving its reasoning capabilities.
3. Evaluation Results
We evaluated our model using WritingBench, a comprehensive framework for assessing large language model writing capabilities. Zhi-Create-Qwen3-32B achieved a score of 82.08 (evaluated with Claude Sonnet 3.7 as the judge), demonstrating significant improvements in creative writing performance. This represents a substantial improvement over the base Qwen3-32B model, which scored 78.97.
The performance comparison across six different domains is presented in the figure below:
4. How to Run Locally
Zhi-Create-Qwen3-32B can be deployed across various hardware configurations, including 80GB memory GPUs and single H20/A800/H800 units. For more accessible deployment, we offer quantized versions: the FP8 quantized model (Zhi-Create-Qwen3-32B-FP8) can run on dual RTX 4090 setups, while the Q4_K_M quantized version can be deployed on a single RTX 4090.
Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
MODEL_NAME = "Zhihu-ai/Zhi-Create-Qwen3-32B"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
# use bf16
# model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map="auto", trust_remote_code=True, bf16=True).eval()
# use fp16
# model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map="auto", trust_remote_code=True, fp16=True).eval()
# use cpu only
# model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map="cpu", trust_remote_code=True).eval()
# use auto mode, automatically select precision based on the device.
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
device_map="auto",
trust_remote_code=True
).eval()
# Specify hyperparameters for generation. But if you use transformers>=4.32.0, there is no need to do this.
# model.generation_config = GenerationConfig.from_pretrained(MODEL_NAME, trust_remote_code=True)
generate_configs = {
"temperature": 0.6,
"do_sample": True,
"top_p": 0.95,
"max_new_tokens": 4096
}
prompt = "请你以鲁迅的口吻,写一篇介绍西湖醋鱼的文章"
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
**generate_configs
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
ZhiLight
You can easily start a service using ZhiLight
docker run -it --net=host --gpus='"device=0"' -v /path/to/model:/mnt/models --entrypoints="" ghcr.io/zhihu/zhilight/zhilight:0.4.21-cu124 python -m zhilight.server.openai.entrypoints.api_server --model-path /mnt/models --port 8000 --enable-reasoning --reasoning-parser deepseek-r1 --served-model-name Zhi-Create-Qwen3-32B
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Zhi-Create-Qwen3-32B",
"prompt": "请你以鲁迅的口吻,写一篇介绍西湖醋鱼的文章",
"max_tokens": 4096,
"temperature": 0.6,
"top_p": 0.95
}'
vllm
For instance, you can easily start a service using vLLM
# install vllm
pip install vllm>=0.6.4.post1
# huggingface model id
vllm serve Zhihu-ai/Zhi-Create-Qwen3-32B --served-model-name Zhi-Create-Qwen3-32B --port 8000
# local path
vllm serve /path/to/model --served-model-name Zhi-Create-Qwen3-32B --port 8000
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Zhi-Create-Qwen3-32B",
"prompt": "请你以鲁迅的口吻,写一篇介绍西湖醋鱼的文章",
"max_tokens": 4096,
"temperature": 0.6,
"top_p": 0.95
}'
SGLang
You can also easily start a service using SGLang
# install SGLang
pip install "sglang[all]>=0.4.5" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python
# huggingface model id
python -m sglang.launch_server --model-path Zhihu-ai/Zhi-Create-Qwen3-32B --served-model-name Zhi-Create-Qwen3-32B --port 8000
# local path
python -m sglang.launch_server --model-path /path/to/model --served-model-name Zhi-Create-Qwen3-32B --port 8000
# send request
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Zhi-Create-Qwen3-32B",
"prompt": "请你以鲁迅的口吻,写一篇介绍西湖醋鱼的文章",
"max_tokens": 4096,
"temperature": 0.6,
"top_p": 0.95
}'
# Alternative: Using OpenAI API
from openai import OpenAI
openai_api_key = "empty"
openai_api_base = "http://127.0.0.1:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base
)
def get_answer(messages):
response = client.chat.completions.create(
messages=messages,
model="Zhi-Create-Qwen3-32B",
max_tokens=4096,
temperature=0.3,
top_p=0.95,
stream=True,
extra_body = {"chat_template_kwargs": {"enable_thinking": True}}
)
answer = ""
reasoning_content_all = ""
for each in response:
each_content = each.choices[0].delta.content
if hasattr(each.choices[0].delta, "content"):
each_content = each.choices[0].delta.content
else:
each_content = None
if hasattr(each.choices[0].delta, "reasoning_content"):
reasoning_content = each.choices[0].delta.reasoning_content
else:
reasoning_content = None
if each_content is not None:
answer += each_content
print(each_content, end="", flush=True)
if reasoning_content is not None:
reasoning_content_all += reasoning_content
print(reasoning_content, end="", flush=True)
return answer, reasoning_content_all
prompt = "请你以鲁迅的口吻,写一篇介绍西湖醋鱼的文章"
messages = [
{"role": "user", "content": prompt}
]
answer, reasoning_content_all = get_answer(messages)
ollama
You can download ollama using this
- quantization: Q4_K_M
ollama run zhihu/zhi-create-qwen3-32b
- bf16
ollama run zhihu/zhi-create-qwen3-32b:bf16
5. Usage Recommendations
For optimal performance, we recommend setting temperature between 0.5-0.7 (0.6 recommended) and top-p to 0.95 for balanced creativity and coherence.
6. Citation
@misc{Zhi-Create-Qwen3-32B,
title={Zhi-Create-Qwen3-32B: RAFT-Enhanced Direct Preference Optimization and Curriculum Learning for Robust Creative Writing in LLMs},
author={Jiewu Wang, Xu Chen, Wenyuan Su, Chao Huang, Hongkui Gao, Lin Feng, Shan Wang, Jingjing Wang, Zebin Ou},
year={2025},
eprint={},
archivePrefix={},
url={https://huggingface.co/Zhihu-ai/Zhi-Create-Qwen3-32B},
}
7. Contact
If you have any questions, please raise an issue or contact us at [email protected].
- Downloads last month
- 26