🤖 Qwen2.5-32B-Coder-NF4-Quantized

This is a 4-bit NF4 Quantized version of huihui-ai/Qwen2.5-Coder-32B-Instruct-abliterated.

The model was quantized to enable efficient running and inference on hardware with limited VRAM, while maintaining performance.

⚙️ Model Specifications and Quantization

This model was loaded and quantized using the bitsandbytes library. The quantization is based on the NF4 (Normal Float 4-bit) format and requires bitsandbytes to load.

Model Configuration (from `config.json`):

Parameter	Value	Description
Architecture	`Qwen2ForCausalLM`	The model's base architecture.
Parameter Count	32 Billion (Original)	The original number of parameters.
Number of Layers	`64`	The number of transformer blocks.
Hidden Size	`5120`	The dimension of the hidden states.
Context Length	`32768`	The maximum context length the model can process.
Dtype (Activations)	`bfloat16`	The data type for activations during inference (recommended for stability).

Quantization Details (`quantization_config`):

Parameter	Value	Description
Method	`bitsandbytes`	The quantization library used.
Load In 4-bit	`true`	Indicates that the model should be loaded in 4-bit.
Quantization Type	`nf4`	Normal Float 4-bit, optimized for transformer weights.
Compute Dtype	`bfloat16`	The dtype the weights are decompressed to for computation (matrix multiplication).
Double Quantization	`true`	Uses an extra 8-bit quantization for the scaling tensors, further reducing memory usage.

💻 Usage (Inference)

To use this quantized model, ensure you have accelerate and bitsandbytes installed. You can load the model directly with the AutoModelForCausalLM from the Hugging Face transformers library.

Required Libraries

pip install transformers accelerate bitsandbytes torch

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig

model_id = "ikarius/Qwen2.5-Coder-32B-Instruct-Abliterated-NF4" 

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load the model in 4-bit using the saved configuration
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16
)

# 📝 Input Prompt
prompt = "def quicksort(arr):"
messages = [
    {"role": "user", "content": f"Write a Python function for quicksort.\n\n{prompt}"}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# Generation
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    pad_token_id=tokenizer.eos_token_id # Ensures correct padding/EOS
)

# Decode and print the result
generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(generated_text)

Disclaimer and Limitations

Abliterated Model Status: This model is based on the "abliterated" variant (*-abliterated). This indicates that certain data, capabilities, or behaviors were deliberately modified or removed from the model during fine-tuning. The quantized version inherits these characteristics. Performance in certain domains may differ compared to the non-abliterated base model.

Memory Requirements: While this model is 4-bit quantized, it is a 32B model and still requires a GPU with significant VRAM (typically ~18 GB VRAM or more, depending on context length).

Accuracy: Quantization to 4-bit (NF4) introduces a small loss of precision. This may potentially affect performance compared to the original FP16/BF16 model.

...

🔗 Sources and Acknowledgements

Original Model: huihui-ai/Qwen2.5-Coder-32B-Instruct-abliterated
Quantization Technology: Bitsandbytes Library ...

Downloads last month: 7

Safetensors

Model size

33B params

Tensor type

F32

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support