ramblingpolymath
/

Qwen3-32B-W8A8

Text Generation

text-generation-inference

8-bit precision

compressed-tensors

Model card Files Files and versions

Tobias Mann commited on Jul 14

Commit

1da2732

·

verified ·

1 Parent(s): 5540a7f

Update README.md

Files changed (1) hide show

README.md +0 -56

README.md CHANGED Viewed

@@ -23,56 +23,6 @@ This is a W8A8 quantized version of [Qwen/Qwen3-32B](https://huggingface.co/Qwen
 - **Model Size**: Significantly reduced from original 32.8B parameters
 - **Precision**: INT8 for both weights and activations
-## Usage
-This quantized model maintains the same API as the original Qwen3-32B model. You can use it with the standard transformers library:
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-model_name = "your-username/qwen3-32b-w8a8"  # Replace with your model path
-# Load the tokenizer and quantized model
-tokenizer = AutoTokenizer.from_pretrained(model_name)
-model = AutoModelForCausalLM.from_pretrained(
-    model_name,
-    torch_dtype="auto",
-    device_map="auto"
-)
-# Prepare model input
-prompt = "Give me a short introduction to large language model."
-messages = [
-    {"role": "user", "content": prompt}
-]
-text = tokenizer.apply_chat_template(
-    messages,
-    tokenize=False,
-    add_generation_prompt=True,
-    enable_thinking=True  # Switches between thinking and non-thinking modes
-)
-model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
-# Generate response
-generated_ids = model.generate(
-    **model_inputs,
-    max_new_tokens=32768
-)
-output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
-# Parse thinking content (same as original model)
-try:
-    index = len(output_ids) - output_ids[::-1].index(151668)  # </think>
-except ValueError:
-    index = 0
-thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
-content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
-print("thinking content:", thinking_content)
-print("content:", content)
-```
 ## Performance Considerations
 - **Memory Usage**: Significantly reduced memory footprint compared to the original FP16/BF16 model
@@ -132,12 +82,6 @@ Follow the same best practices as the original model:
 3. **Avoid Greedy Decoding**: Do not use greedy decoding in thinking mode
-## Deployment
-The quantized model can be deployed using the same frameworks as the original:
-- **SGLang**: `python -m sglang.launch_server --model-path your-username/qwen3-32b-w8a8 --reasoning-parser qwen3`
-- **vLLM**: `vllm serve your-username/qwen3-32b-w8a8 --enable-reasoning --reasoning-parser deepseek_r1`
 ## Original Model Information

 - **Model Size**: Significantly reduced from original 32.8B parameters
 - **Precision**: INT8 for both weights and activations
 ## Performance Considerations
 - **Memory Usage**: Significantly reduced memory footprint compared to the original FP16/BF16 model
 3. **Avoid Greedy Decoding**: Do not use greedy decoding in thinking mode
 ## Original Model Information