Instructions to use ikarius/Qwen3-8B-Abliterated-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ikarius/Qwen3-8B-Abliterated-FP8 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ikarius/Qwen3-8B-Abliterated-FP8")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("ikarius/Qwen3-8B-Abliterated-FP8")
model = AutoModelForCausalLM.from_pretrained("ikarius/Qwen3-8B-Abliterated-FP8")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use ikarius/Qwen3-8B-Abliterated-FP8 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ikarius/Qwen3-8B-Abliterated-FP8"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ikarius/Qwen3-8B-Abliterated-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ikarius/Qwen3-8B-Abliterated-FP8

SGLang

How to use ikarius/Qwen3-8B-Abliterated-FP8 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ikarius/Qwen3-8B-Abliterated-FP8" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ikarius/Qwen3-8B-Abliterated-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ikarius/Qwen3-8B-Abliterated-FP8" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ikarius/Qwen3-8B-Abliterated-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use ikarius/Qwen3-8B-Abliterated-FP8 with Docker Model Runner:
```
docker model run hf.co/ikarius/Qwen3-8B-Abliterated-FP8
```

Qwen3-8B-FineGrained-FP8 (Blackwell Optimized)

This repository contains a high-precision Fine-Grained FP8 quantization of huihui-ai/Qwen3-8B-Instruct-Abliterated.

The model has been specifically quantized using parameters optimized for next-generation hardware, particularly the NVIDIA Blackwell (RTX 50-series) architecture.

Model Highlights

Architecture: Qwen3-8B
Quantization: Fine-Grained FP8
Optimization: Optimized for Blackwell Tensor Cores (weight_block_size=(128, 128))
Abliterated: Based on the version by huihui-ai, where refusal mechanisms have been removed to provide more direct, unfiltered responses.

Technical Configuration

The quantization was performed using FineGrainedFP8Config with the following settings:

Weight Block Size: 128x128. This specific block size is designed to align with the hardware throughput of RTX 5090 and other Blackwell-based GPUs, allowing for native execution with minimal overhead.
Precision: Unlike standard per-tensor FP8, the fine-grained approach maintains significantly higher output quality by scaling weights in smaller blocks.

Hardware Requirements

Optimal: NVIDIA RTX 50-series (Blackwell) for native hardware acceleration.
Supported: NVIDIA RTX 40-series (Ada Lovelace), H100, and L40S.
VRAM: Occupies approximately 8-9 GB of VRAM. A 12GB+ card is recommended for handling longer context windows and KV-cache.

Usage

You can load this model directly using the transformers library. Ensure you have the latest version of accelerate and transformers installed.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "ikarius/Qwen3-8B-FineGrained-FP8"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    dtype="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

prompt = "Explain the advantages of FP8 quantization for LLMs."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quantization Process

The model was quantized from the BF16 source using the following logic:

Loaded with dtype="auto" and device_map="auto".

Configured with FineGrainedFP8Config(weight_block_size=(128, 128)).

Weights were saved in the optimized FP8 format to allow for immediate loading without re-quantization.

Disclaimer

This is an abliterated model. It has fewer safety guardrails compared to the original Qwen3 release. Users are responsible for their own implementations of moderation layers and for using the model ethically and legally.

Credits

Original Model: Qwen Team

Abliteration: huihui-ai

Downloads last month: 62

Safetensors

Model size

8B params

Tensor type

BF16

F8_E4M3