Intel/Qwen3-235B-A22B-Thinking-2507-int4-AutoRound-v0

Model Details

This model is an int4 model with group_size 128 and symmetric quantization of Qwen/Qwen3-235B-A22B-Thinking-2507 generated by intel/auto-round via auto-round-light. We recommend using auto-round to generate the model, which we will upload later.

Please follow the license of the original model.

How To Use

INT4 Inference on CPU/Intel GPU/CUDA

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Intel/Qwen3-235B-A22B-Thinking-2507-int4-AutoRound-v0"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content) # no opening <think> tag
print("content:", content)
"""
thinking content: Okay, the user asked for a short introduction to large language models. Let me start by understanding what they need. They probably want a concise yet clear explanation without too much jargon. Since they specified "short," I should keep it to the essentials.

First, I need to define what an LLM is. Mention that it's an AI trained on vast text data. Highlight key capabilities like generating text, answering questions, and translating. But avoid diving into technical details like transformers or training processes unless necessary.

Wait, the user might not know terms like "neural networks." Should I use simpler terms? Maybe say "advanced AI systems" instead. Also, emphasize the scale—billions of parameters and massive datasets. That's crucial because size is what makes them "large."

They might be curious about how LLMs are used. Examples like chatbots, writing assistance, or code generation could help. But since it's short, pick one or two common applications. Maybe mention ChatGPT as a well-known example.

Also, note that LLMs predict the next word but can do more complex tasks. Clarify that they don't "understand" like humans but recognize patterns. Important to address limitations briefly to set realistic expectations.

Check if the response covers: definition, how they work (simplified), what they can do, and a caveat. Keep it to 3-4 sentences. Avoid markdown, use plain English. Make sure it's accessible to someone new to the topic.

Wait, the user might be a student, a professional, or just curious. Either way, clarity is key. Avoid acronyms without explanation. Use "LLM" after spelling it out first. Ensure the intro answers "what, how, why" succinctly.

Double-check length. Original response was about 100 words. That's good. Trim any fluff. Phrases like "trained on massive text datasets" are better than "enormous amounts of text." Stay precise. Mention "statistical patterns" to explain how they generate text without true comprehension.

Finally, end with a note on limitations to prevent misunderstandings. Say they can make mistakes or lack real understanding. That's important for responsible use.
</think>
content: A **Large Language Model (LLM)** is an advanced AI system trained on massive text datasets to understand, generate, and manipulate human language. Using deep learning (typically transformer architectures), it learns statistical patterns from billions of sentences, enabling it to:
- **Generate** coherent text (e

"""

Generate the model

Here is the sample command to reproduce the model

auto-round-light --model Qwen/Qwen3-235B-A22B-Thinking-2507 --output_dir "./tmp_autoround" --enable_torch_compile

Ethical Considerations and Limitations

The model can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs.

Therefore, before deploying any applications of the model, developers should perform safety testing.

Caveats and Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

Here are a couple of useful links to learn more about Intel's AI software:

Intel Neural Compressor link

Disclaimer

The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.

Cite

@article{cheng2023optimize, title={Optimize weight rounding via signed gradient descent for the quantization of llms}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} }

arxiv github

Intel
/

Qwen3-235B-A22B-Thinking-2507-int4-AutoRound-v0