Text Generation
Transformers
Safetensors
Chinese
English
qwen3
conversational
text-generation-inference

Zhi-Create-Qwen3-32B

1. Introduction

Zhi-Create-Qwen3-32B is a fine-tuned model derived from Qwen/Qwen3-32B, with a focus on enhancing creative writing capabilities. Through careful optimization, the model shows promising improvements in creative writing performance, as evaluated using the WritingBench. In our evaluation, the model attains a score of 82.08 on WritingBench, which represents a significant improvement over the base Qwen3-32B model's score of 78.97.

Additionally, to maintain the model's general capabilities such as knowledge and reasoning, we performed fine-grained data mixture experiments by combining general knowledge, mathematics, code, and other data types. The final evaluation results show that general capabilities remain stable with no significant decline compared to the base model.

2. Training Process

Data

The model's training corpus comprises three primary data sources: rigorously filtered open-source datasets, synthesized chain-of-thought reasoning corpora, and curated question-answer pairs from Zhihu.

To achieve optimal domain coverage, we meticulously balanced the distribution of various datasets through data mixture optimization experiments. These datasets include Dolphin-r1, Congliu/Chinese-DeepSeek-R1-Distill-data-110k, a-m-team/AM-DeepSeek-R1-0528-Distilled, alongside high-quality content from Zhihu. All datasets underwent comprehensive quality assurance through our Reward Model (RM) filtering pipeline. To guarantee the model’s foundational knowledge and reasoning capabilities, creative writing data accounted for approximately 23% of the training data, with the remainder consisting of mathematics, code, and fundamental general knowledge data. The chain-of-thought (CoT) reasoning components in the training data was synthesized using deepseek-ai/DeepSeek-R1-0528 and other similar models.

The detailed data distribution is shown in the figure below:

data-distribution

Figure 1: Training data distribution showing the composition of different data sources, with creative writing data accounting for approximately 23% of the total training corpus, alongside mathematics, code, and general knowledge data.

Training

Supervised Fine-tuning (SFT): We employed a curriculum learning strategy for supervised fine-tuning. This methodical approach systematically enhances creative writing capabilities while incorporating diverse domain data to maintain core competencies and mitigate catastrophic forgetting. Adopting a multi-stage progressive iteration method, we select samples that were insufficiently trained in previous rounds and categorize samples by reasoning complexity and context length. This allows us to gradually increase the difficulty of training samples, achieving step-by-step enhancement in model performance.

Direct Preference Optimization (DPO): We integrate the RAFT(Reward-Ranked Fine-Tuning) method, combining rule-based systems and LLM-as-judge approaches to identify correct and incorrect samples. This enables the construction of DPO preference sample pairs to address issues such as Chinese-English code-mixing and undesirable repetition in the model, while simultaneously improving its reasoning capabilities.

3. Evaluation Results

We evaluated our model using WritingBench, a comprehensive framework for assessing large language model writing capabilities. Zhi-Create-Qwen3-32B achieved a score of 82.08 (evaluated with Claude Sonnet 3.7 as the judge), demonstrating significant improvements in creative writing performance. This represents a substantial improvement over the base Qwen3-32B model, which scored 78.97.

The performance comparison across six different domains is presented in the figure below:

writingbench

Figure 2: WritingBench performance comparison between Zhi-Create-Qwen3-32B and Qwen3-32B across six domains, evaluated using WritingBench with Claude 3.7 Sonnet as the judge model. The domains encompass: (D1) Academic & Engineering, (D2) Finance & Business, (D3) Politics & Law, (D4) Literature & Art, (D5) Education, and (D6) Advertising & Marketing.

4. How to Run Locally

Zhi-Create-Qwen3-32B can be deployed across various hardware configurations, including 80GB memory GPUs and single H20/A800/H800 units. For more accessible deployment, we offer quantized versions: the FP8 quantized model (Zhi-Create-Qwen3-32B-FP8) can run on dual RTX 4090 setups, while the Q4_K_M quantized version can be deployed on a single RTX 4090.

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig

MODEL_NAME = "Zhihu-ai/Zhi-Create-Qwen3-32B"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)

# use bf16
# model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map="auto", trust_remote_code=True, bf16=True).eval()
# use fp16
# model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map="auto", trust_remote_code=True, fp16=True).eval()
# use cpu only
# model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map="cpu", trust_remote_code=True).eval()
# use auto mode, automatically select precision based on the device.
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="auto",
    trust_remote_code=True
).eval()

# Specify hyperparameters for generation. But if you use transformers>=4.32.0, there is no need to do this.
# model.generation_config = GenerationConfig.from_pretrained(MODEL_NAME, trust_remote_code=True)

generate_configs = {
    "temperature": 0.6,
    "do_sample": True,
    "top_p": 0.95,
    "max_new_tokens": 4096
}

prompt = "请你以鲁迅的口吻,写一篇介绍西湖醋鱼的文章"
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    **generate_configs
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

ZhiLight

You can easily start a service using ZhiLight

docker run -it --net=host --gpus='"device=0"' -v /path/to/model:/mnt/models --entrypoints="" ghcr.io/zhihu/zhilight/zhilight:0.4.21-cu124 python -m zhilight.server.openai.entrypoints.api_server --model-path /mnt/models --port 8000 --enable-reasoning --reasoning-parser deepseek-r1 --served-model-name Zhi-Create-Qwen3-32B

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Zhi-Create-Qwen3-32B",
        "prompt": "请你以鲁迅的口吻,写一篇介绍西湖醋鱼的文章",
        "max_tokens": 4096,
        "temperature": 0.6,
        "top_p": 0.95
    }'

vllm

For instance, you can easily start a service using vLLM

# install vllm
pip install vllm>=0.6.4.post1

# huggingface model id
vllm serve Zhihu-ai/Zhi-Create-Qwen3-32B --served-model-name Zhi-Create-Qwen3-32B --port 8000

# local path
vllm serve /path/to/model  --served-model-name Zhi-Create-Qwen3-32B --port 8000

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Zhi-Create-Qwen3-32B",
        "prompt": "请你以鲁迅的口吻,写一篇介绍西湖醋鱼的文章",
        "max_tokens": 4096,
        "temperature": 0.6,
        "top_p": 0.95
    }'

SGLang

You can also easily start a service using SGLang

# install SGLang
pip install "sglang[all]>=0.4.5" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python

# huggingface model id 
python -m sglang.launch_server --model-path Zhihu-ai/Zhi-Create-Qwen3-32B --served-model-name Zhi-Create-Qwen3-32B --port 8000

# local path
python -m sglang.launch_server --model-path /path/to/model  --served-model-name Zhi-Create-Qwen3-32B --port 8000

# send request
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Zhi-Create-Qwen3-32B",
        "prompt": "请你以鲁迅的口吻,写一篇介绍西湖醋鱼的文章",
        "max_tokens": 4096,
        "temperature": 0.6,
        "top_p": 0.95
    }'

# Alternative: Using OpenAI API
from openai import OpenAI
openai_api_key = "empty"
openai_api_base = "http://127.0.0.1:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base
)

def get_answer(messages):
    response = client.chat.completions.create(
        messages=messages,
        model="Zhi-Create-Qwen3-32B",
        max_tokens=4096,
        temperature=0.3,
        top_p=0.95,
        stream=True,
        extra_body = {"chat_template_kwargs": {"enable_thinking": True}}
    )
    answer = ""
    reasoning_content_all = ""
    for each in response:
        each_content = each.choices[0].delta.content
        if hasattr(each.choices[0].delta, "content"):
            each_content = each.choices[0].delta.content
        else:
            each_content = None
        if hasattr(each.choices[0].delta, "reasoning_content"):
            reasoning_content = each.choices[0].delta.reasoning_content
        else:
            reasoning_content = None
        if each_content is not None:
            answer += each_content
            print(each_content, end="", flush=True)
        if reasoning_content is not None:
            reasoning_content_all += reasoning_content
            print(reasoning_content, end="", flush=True)
    return answer, reasoning_content_all

prompt = "请你以鲁迅的口吻,写一篇介绍西湖醋鱼的文章"
messages = [
    {"role": "user", "content": prompt}
]

answer, reasoning_content_all = get_answer(messages)

ollama

You can download ollama using this

  • quantization: Q4_K_M
ollama run zhihu/zhi-create-qwen3-32b
  • bf16
ollama run zhihu/zhi-create-qwen3-32b:bf16

5. Usage Recommendations

For optimal performance, we recommend setting temperature between 0.5-0.7 (0.6 recommended) and top-p to 0.95 for balanced creativity and coherence.

6. Citation

@misc{Zhi-Create-Qwen3-32B,
      title={Zhi-Create-Qwen3-32B: RAFT-Enhanced Direct Preference Optimization and Curriculum Learning for Robust Creative Writing in LLMs}, 
      author={Jiewu Wang, Xu Chen, Wenyuan Su, Chao Huang, Hongkui Gao, Lin Feng, Shan Wang, Jingjing Wang, Zebin Ou},
      year={2025},
      eprint={},
      archivePrefix={},
      url={https://huggingface.co/Zhihu-ai/Zhi-Create-Qwen3-32B}, 
}

7. Contact

If you have any questions, please raise an issue or contact us at [email protected].

Downloads last month
26
Safetensors
Model size
32.8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Model tree for Zhihu-ai/Zhi-Create-Qwen3-32B

Base model

Qwen/Qwen3-32B
Finetuned
(50)
this model
Quantizations
3 models

Datasets used to train Zhihu-ai/Zhi-Create-Qwen3-32B