Intel/Qwen3-235B-A22B-Thinking-2507-int4-AutoRound

Model Details

This model is an int4 model with group_size 128 and symmetric quantization of Qwen/Qwen3-235B-A22B-Thinking-2507 generated by intel/auto-round algorithm. Please follow the license of the original model.

How To Use

INT4 Inference on CPU/Intel GPU/CUDA

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Intel/Qwen3-235B-A22B-Thinking-2507-int4-AutoRound"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content) # no opening <think> tag
print("content:", content)
"""
thinking content:
content: Okay, the user asked for a short introduction to large language models. Let me start by understanding their request. They want something concise, so I need to keep it brief but informative.

First, I should define what an LLM is. They're AI models trained on massive text data. The key points are scale, training data, and capabilities. I should mention they're based on neural networks, specifically transformers, but maybe keep the technical terms minimal since it's an intro.

The user might be someone new to AI, so avoiding jargon is important. But they might also want to know why LLMs are significant. Highlighting their ability to generate human-like text and understand context would be essential.

I should cover the main features: size (billions of parameters), training data (vast text corpora), and tasks they can perform like answering questions, writing, translation. Also, mention that they predict the next word but can do more complex tasks through fine-tuning.

Wait, the user said "short," so I need to be concise. Maybe structure it in a few key points. Start with a simple definition, then how they work at a high level, their capabilities, and limitations. But the user didn't ask for limitations, so maybe skip that unless it's necessary for a basic intro.

Check if there are common misconceptions. Some people think LLMs understand like humans, but they're pattern-based. Should I clarify that? The intro should be accurate but not too technical. Maybe say they "learn patterns" instead of true understanding.

Also, mention real-world applications briefly. Things like chatbots, content creation, coding assistants. That makes it relatable. But keep it short.

I recall the previous response had bullet points. Maybe a paragraph followed by bullet points for clarity. The user might appreciate a structured yet concise answer.

Wait, the user might need this for a presentation, study, or just curiosity. Since they didn't specify, keeping it general is safe. Avoid deep technical details. Focus on what LLMs are, what they do, and why they're important now.

Make sure to explain terms like "parameters" simply—maybe "internal settings" or "knowledge representations." But "parameters" is standard; perhaps just say "billions of parameters" without diving deep.

Also, note that they're a type of generative AI. Emphasize the generative aspect since that's a key feature.

Double-check if the response covers: definition, how they're trained,

"""

Generate the model

Here is the sample command to reproduce the model

auto-round --model Qwen/Qwen3-235B-A22B-Thinking-2507 --output_dir "./tmp_autoround" --enable_torch_compile

Ethical Considerations and Limitations

The model can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs.

Therefore, before deploying any applications of the model, developers should perform safety testing.

Caveats and Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

Here are a couple of useful links to learn more about Intel's AI software:

Intel Neural Compressor link

Disclaimer

The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.

Cite

@article{cheng2023optimize, title={Optimize weight rounding via signed gradient descent for the quantization of llms}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} }

arxiv github

Intel
/

Qwen3-235B-A22B-Thinking-2507-int4-AutoRound