馃 Qwen2.5-32B-Coder-NF4-Quantized
This is a 4-bit NF4 Quantized version of huihui-ai/Qwen2.5-Coder-32B-Instruct-abliterated.
The model was quantized to enable efficient running and inference on hardware with limited VRAM, while maintaining performance.
鈿欙笍 Model Specifications and Quantization
This model was loaded and quantized using the bitsandbytes library. The quantization is based on the NF4 (Normal Float 4-bit) format and requires bitsandbytes to load.
Model Configuration (from config.json):
| Parameter | Value | Description |
|---|---|---|
| Architecture | Qwen2ForCausalLM |
The model's base architecture. |
| Parameter Count | 32 Billion (Original) | The original number of parameters. |
| Number of Layers | 64 |
The number of transformer blocks. |
| Hidden Size | 5120 |
The dimension of the hidden states. |
| Context Length | 32768 |
The maximum context length the model can process. |
| Dtype (Activations) | bfloat16 |
The data type for activations during inference (recommended for stability). |
Quantization Details (quantization_config):
| Parameter | Value | Description |
|---|---|---|
| Method | bitsandbytes |
The quantization library used. |
| Load In 4-bit | true |
Indicates that the model should be loaded in 4-bit. |
| Quantization Type | nf4 |
Normal Float 4-bit, optimized for transformer weights. |
| Compute Dtype | bfloat16 |
The dtype the weights are decompressed to for computation (matrix multiplication). |
| Double Quantization | true |
Uses an extra 8-bit quantization for the scaling tensors, further reducing memory usage. |
馃捇 Usage (Inference)
To use this quantized model, ensure you have accelerate and bitsandbytes installed. You can load the model directly with the AutoModelForCausalLM from the Hugging Face transformers library.
Required Libraries
pip install transformers accelerate bitsandbytes torch
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
model_id = "ikarius/Qwen2.5-Coder-32B-Instruct-Abliterated-NF4"
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Load the model in 4-bit using the saved configuration
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16
)
# 馃摑 Input Prompt
prompt = "def quicksort(arr):"
messages = [
{"role": "user", "content": f"Write a Python function for quicksort.\n\n{prompt}"}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# Generation
generated_ids = model.generate(
**model_inputs,
max_new_tokens=512,
do_sample=True,
temperature=0.7,
pad_token_id=tokenizer.eos_token_id # Ensures correct padding/EOS
)
# Decode and print the result
generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(generated_text)
Disclaimer and Limitations
Abliterated Model Status: This model is based on the "abliterated" variant (*-abliterated). This indicates that certain data, capabilities, or behaviors were deliberately modified or removed from the model during fine-tuning. The quantized version inherits these characteristics. Performance in certain domains may differ compared to the non-abliterated base model.
Memory Requirements: While this model is 4-bit quantized, it is a 32B model and still requires a GPU with significant VRAM (typically ~18 GB VRAM or more, depending on context length).
Accuracy: Quantization to 4-bit (NF4) introduces a small loss of precision. This may potentially affect performance compared to the original FP16/BF16 model.
...
馃敆 Sources and Acknowledgements
- Original Model: huihui-ai/Qwen2.5-Coder-32B-Instruct-abliterated
- Quantization Technology: Bitsandbytes Library ...
- Downloads last month
- 7
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
馃檵
Ask for provider support