Tobias Mann commited on
Commit
1da2732
·
verified ·
1 Parent(s): 5540a7f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -56
README.md CHANGED
@@ -23,56 +23,6 @@ This is a W8A8 quantized version of [Qwen/Qwen3-32B](https://huggingface.co/Qwen
23
  - **Model Size**: Significantly reduced from original 32.8B parameters
24
  - **Precision**: INT8 for both weights and activations
25
 
26
- ## Usage
27
-
28
- This quantized model maintains the same API as the original Qwen3-32B model. You can use it with the standard transformers library:
29
-
30
- ```python
31
- from transformers import AutoModelForCausalLM, AutoTokenizer
32
-
33
- model_name = "your-username/qwen3-32b-w8a8" # Replace with your model path
34
-
35
- # Load the tokenizer and quantized model
36
- tokenizer = AutoTokenizer.from_pretrained(model_name)
37
- model = AutoModelForCausalLM.from_pretrained(
38
- model_name,
39
- torch_dtype="auto",
40
- device_map="auto"
41
- )
42
-
43
- # Prepare model input
44
- prompt = "Give me a short introduction to large language model."
45
- messages = [
46
- {"role": "user", "content": prompt}
47
- ]
48
- text = tokenizer.apply_chat_template(
49
- messages,
50
- tokenize=False,
51
- add_generation_prompt=True,
52
- enable_thinking=True # Switches between thinking and non-thinking modes
53
- )
54
- model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
55
-
56
- # Generate response
57
- generated_ids = model.generate(
58
- **model_inputs,
59
- max_new_tokens=32768
60
- )
61
- output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
62
-
63
- # Parse thinking content (same as original model)
64
- try:
65
- index = len(output_ids) - output_ids[::-1].index(151668) # </think>
66
- except ValueError:
67
- index = 0
68
-
69
- thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
70
- content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
71
-
72
- print("thinking content:", thinking_content)
73
- print("content:", content)
74
- ```
75
-
76
  ## Performance Considerations
77
 
78
  - **Memory Usage**: Significantly reduced memory footprint compared to the original FP16/BF16 model
@@ -132,12 +82,6 @@ Follow the same best practices as the original model:
132
 
133
  3. **Avoid Greedy Decoding**: Do not use greedy decoding in thinking mode
134
 
135
- ## Deployment
136
-
137
- The quantized model can be deployed using the same frameworks as the original:
138
-
139
- - **SGLang**: `python -m sglang.launch_server --model-path your-username/qwen3-32b-w8a8 --reasoning-parser qwen3`
140
- - **vLLM**: `vllm serve your-username/qwen3-32b-w8a8 --enable-reasoning --reasoning-parser deepseek_r1`
141
 
142
  ## Original Model Information
143
 
 
23
  - **Model Size**: Significantly reduced from original 32.8B parameters
24
  - **Precision**: INT8 for both weights and activations
25
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
  ## Performance Considerations
27
 
28
  - **Memory Usage**: Significantly reduced memory footprint compared to the original FP16/BF16 model
 
82
 
83
  3. **Avoid Greedy Decoding**: Do not use greedy decoding in thinking mode
84
 
 
 
 
 
 
 
85
 
86
  ## Original Model Information
87