Instructions to use ikarius/Qwen3-8B-Abliterated-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ikarius/Qwen3-8B-Abliterated-FP8 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ikarius/Qwen3-8B-Abliterated-FP8") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("ikarius/Qwen3-8B-Abliterated-FP8") model = AutoModelForCausalLM.from_pretrained("ikarius/Qwen3-8B-Abliterated-FP8") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use ikarius/Qwen3-8B-Abliterated-FP8 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ikarius/Qwen3-8B-Abliterated-FP8" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ikarius/Qwen3-8B-Abliterated-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ikarius/Qwen3-8B-Abliterated-FP8
- SGLang
How to use ikarius/Qwen3-8B-Abliterated-FP8 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ikarius/Qwen3-8B-Abliterated-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ikarius/Qwen3-8B-Abliterated-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ikarius/Qwen3-8B-Abliterated-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ikarius/Qwen3-8B-Abliterated-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use ikarius/Qwen3-8B-Abliterated-FP8 with Docker Model Runner:
docker model run hf.co/ikarius/Qwen3-8B-Abliterated-FP8
Qwen3-8B-FineGrained-FP8 (Blackwell Optimized)
This repository contains a high-precision Fine-Grained FP8 quantization of huihui-ai/Qwen3-8B-Instruct-Abliterated.
The model has been specifically quantized using parameters optimized for next-generation hardware, particularly the NVIDIA Blackwell (RTX 50-series) architecture.
Model Highlights
- Architecture: Qwen3-8B
- Quantization: Fine-Grained FP8
- Optimization: Optimized for Blackwell Tensor Cores (
weight_block_size=(128, 128)) - Abliterated: Based on the version by huihui-ai, where refusal mechanisms have been removed to provide more direct, unfiltered responses.
Technical Configuration
The quantization was performed using FineGrainedFP8Config with the following settings:
- Weight Block Size: 128x128. This specific block size is designed to align with the hardware throughput of RTX 5090 and other Blackwell-based GPUs, allowing for native execution with minimal overhead.
- Precision: Unlike standard per-tensor FP8, the fine-grained approach maintains significantly higher output quality by scaling weights in smaller blocks.
Hardware Requirements
- Optimal: NVIDIA RTX 50-series (Blackwell) for native hardware acceleration.
- Supported: NVIDIA RTX 40-series (Ada Lovelace), H100, and L40S.
- VRAM: Occupies approximately 8-9 GB of VRAM. A 12GB+ card is recommended for handling longer context windows and KV-cache.
Usage
You can load this model directly using the transformers library. Ensure you have the latest version of accelerate and transformers installed.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "ikarius/Qwen3-8B-FineGrained-FP8"
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
dtype="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
prompt = "Explain the advantages of FP8 quantization for LLMs."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Quantization Process
The model was quantized from the BF16 source using the following logic:
Loaded with dtype="auto" and device_map="auto".
Configured with FineGrainedFP8Config(weight_block_size=(128, 128)).
Weights were saved in the optimized FP8 format to allow for immediate loading without re-quantization.
Disclaimer
This is an abliterated model. It has fewer safety guardrails compared to the original Qwen3 release. Users are responsible for their own implementations of moderation layers and for using the model ethically and legally.
Credits
Original Model: Qwen Team
Abliteration: huihui-ai
- Downloads last month
- 62