馃 Qwen2.5-32B-Coder-NF4-Quantized

This is a 4-bit NF4 Quantized version of huihui-ai/Qwen2.5-Coder-32B-Instruct-abliterated.

The model was quantized to enable efficient running and inference on hardware with limited VRAM, while maintaining performance.


鈿欙笍 Model Specifications and Quantization

This model was loaded and quantized using the bitsandbytes library. The quantization is based on the NF4 (Normal Float 4-bit) format and requires bitsandbytes to load.

Model Configuration (from config.json):

Parameter Value Description
Architecture Qwen2ForCausalLM The model's base architecture.
Parameter Count 32 Billion (Original) The original number of parameters.
Number of Layers 64 The number of transformer blocks.
Hidden Size 5120 The dimension of the hidden states.
Context Length 32768 The maximum context length the model can process.
Dtype (Activations) bfloat16 The data type for activations during inference (recommended for stability).

Quantization Details (quantization_config):

Parameter Value Description
Method bitsandbytes The quantization library used.
Load In 4-bit true Indicates that the model should be loaded in 4-bit.
Quantization Type nf4 Normal Float 4-bit, optimized for transformer weights.
Compute Dtype bfloat16 The dtype the weights are decompressed to for computation (matrix multiplication).
Double Quantization true Uses an extra 8-bit quantization for the scaling tensors, further reducing memory usage.

馃捇 Usage (Inference)

To use this quantized model, ensure you have accelerate and bitsandbytes installed. You can load the model directly with the AutoModelForCausalLM from the Hugging Face transformers library.

Required Libraries

pip install transformers accelerate bitsandbytes torch
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig

model_id = "ikarius/Qwen2.5-Coder-32B-Instruct-Abliterated-NF4" 

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load the model in 4-bit using the saved configuration
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16
)

# 馃摑 Input Prompt
prompt = "def quicksort(arr):"
messages = [
    {"role": "user", "content": f"Write a Python function for quicksort.\n\n{prompt}"}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# Generation
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    pad_token_id=tokenizer.eos_token_id # Ensures correct padding/EOS
)

# Decode and print the result
generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(generated_text)

Disclaimer and Limitations

Abliterated Model Status: This model is based on the "abliterated" variant (*-abliterated). This indicates that certain data, capabilities, or behaviors were deliberately modified or removed from the model during fine-tuning. The quantized version inherits these characteristics. Performance in certain domains may differ compared to the non-abliterated base model.

Memory Requirements: While this model is 4-bit quantized, it is a 32B model and still requires a GPU with significant VRAM (typically ~18 GB VRAM or more, depending on context length).

Accuracy: Quantization to 4-bit (NF4) introduces a small loss of precision. This may potentially affect performance compared to the original FP16/BF16 model.

...

馃敆 Sources and Acknowledgements

Downloads last month
7
Safetensors
Model size
33B params
Tensor type
F32
F16
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support