Instructions to use rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF", dtype="auto") - llama-cpp-python
How to use rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF", filename="Qwopus-GLM-DareTies-F16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF:Q4_K_M
Use Docker
docker model run hf.co/rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF:Q4_K_M
- SGLang
How to use rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF with Ollama:
ollama run hf.co/rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF:Q4_K_M
- Unsloth Studio new
How to use rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF to start chatting
- Pi new
How to use rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF with Docker Model Runner:
docker model run hf.co/rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF:Q4_K_M
- Lemonade
How to use rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Qwopus-GLM-9B-DualReason-Distilled-GGUF-Q4_K_M
List all available models
lemonade list
Qwopus-GLM-9B-DualReason-Distilled
A DARE-TIES weight merge of two specialized Qwen3.5-9B fine-tunes, combining Opus-style agentic reasoning with GLM-5.1 structured chain-of-thought into a single 9B model.
This model is the result of an extensive research process that explored ANN-based layer routing before converging on DARE-TIES as the optimal merging strategy. The full methodology and findings are documented below.
Benchmark Results
⏳ Evaluation in progress — Results will be updated shortly.
| Benchmark | Score |
|---|---|
| HumanEval pass@1 | ⏳ |
| MMLU-Pro | ⏳ |
| GSM8K | ⏳ |
| ARC-Challenge | ⏳ |
Model Details
| Property | Value |
|---|---|
| Base architecture | Qwen3.5-9B |
| Parameters | ~9B |
| Merge method | DARE-TIES (weight=0.5, density=0.5) |
| Context length | 262,144 tokens |
| License | Apache 2.0 |
| Training hardware | H100 80GB (Vast.ai) |
Available Quantizations
| Quantization | Size | Use case |
|---|---|---|
| F16 | 17.9 GB | Full precision / re-quantization |
| Q8_0 | 9.5 GB | Near-lossless |
| Q6_K | 7.4 GB | High quality |
| Q5_K_M | 6.5 GB | Recommended for quality |
| Q4_K_M | 5.6 GB | Best balance ← start here |
| Q4_0 | 5.3 GB | Fast inference |
| IQ4_XS | 5.2 GB | Efficient 4-bit |
| Q3_K_S | 4.3 GB | Small footprint |
| IQ3_M | 4.4 GB | Small + imatrix |
| Q2_K | 3.8 GB | Minimum quality |
Usage
With llama.cpp (recommended)
llama-server \
-m Qwopus-GLM-9B-DualReason-Distilled-Q4_K_M.gguf \
--ctx-size 32768 \
--flash-attn on \
--n-gpu-layers 99
With Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"rico03/Qwopus-GLM-9B-DualReason-Distilled",
dtype=torch.bfloat16,
device_map="cuda",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"rico03/Qwopus-GLM-9B-DualReason-Distilled",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "Solve step by step: 2x + 5 = 13"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to("cuda")
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Source Models
All credit for the source models goes to Jackrong.
Jackrong/Qwopus3.5-9B-v3.5
A reasoning-enhanced fine-tune of Qwen3.5-9B trained with ~2x more SFT data than v3, focused on structured reasoning, tool-augmented workflows, and multi-step agentic tasks.
- HumanEval pass@1: 87.80%
- MMLU-Pro: 90.36%
- Training guide: GitHub
Jackrong/Qwen3.5-9B-GLM5.1-Distill-v1
A distilled variant of Qwen3.5-9B trained on high-quality reasoning data from a GLM-5.1 teacher model (~700x scale). Focused on structured reasoning, instruction-following, and problem decomposition.
- Training data:
Jackrong/GLM-5.1-Reasoning-1M-Cleaned
Why DARE-TIES and Not a Frankenmerge
This model went through an extensive research process before arriving at DARE-TIES.
Step 1 — ANN Router (abandoned)
We designed a lightweight ANN router (~1M parameters) to learn optimal layer selection between the two models:
Input hidden states (both models)
↓
LayerRouter MLP: Linear(8320,1024) → SiLU → LayerNorm → Dropout → Linear(1024,256) → SiLU → LayerNorm → Linear(256,1)
↓
α ∈ [0,1] per layer → mixed = α × h_qwopus + (1-α) × h_glm
Trained on 12,297 examples from GSM8K, MMLU, ARC-Challenge, HumanEval, and IFEval with entropy regularization. Result: the router consistently collapsed to always selecting one model regardless of training objective — entropy loss or cross-entropy.
Step 2 — Cosine Similarity Analysis
To understand why, we computed cosine similarity between hidden states of both models across all 32 layers on 12,297 examples:
| Layer | CosSim | Layer | CosSim | Layer | CosSim | Layer | CosSim |
|---|---|---|---|---|---|---|---|
| 00 | 1.000 | 08 | 0.994 | 16 | 0.994 | 24 | 0.996 |
| 01 | 1.000 | 09 | 0.996 | 17 | 0.994 | 25 | 0.996 |
| 02 | 0.999 | 10 | 0.996 | 18 | 0.996 | 26 | 0.996 |
| 03 | 0.999 | 11 | 0.997 | 19 | 0.996 | 27 | 0.995 |
| 04 | 0.998 | 12 | 0.996 | 20 | 0.996 | 28 | 0.994 |
| 05 | 0.998 | 13 | 0.996 | 21 | 0.996 | 29 | 0.993 |
| 06 | 0.996 | 14 | 0.994 | 22 | 0.997 | 30 | 0.993 |
| 07 | 0.995 | 15 | 0.993 | 23 | 0.996 | 31 | 0.992 |
Conclusion: Cosine similarity of 0.992–1.000 across all layers means the two models produce nearly identical hidden states. They differ in their weights, not in the structure of their internal representations. Layer-by-layer selection adds no value — the optimal combination must happen at the weight level.
Step 3 — DARE-TIES (adopted)
DARE-TIES operates directly on model weights, interpolating the specialized knowledge of both models continuously:
tv_g = weights_glm - weights_qwopus # task vector
mask = torch.rand_like(tv_g) < 0.5 # DARE dropout
tv_g = tv_g * mask / 0.5
sign = tv_g.sign() # TIES sign resolution
tv_g = tv_g * (tv_g.sign() == sign).float()
merged = weights_qwopus + 0.5 * tv_g # final merge
Unlike Jackrong's frankenmerge which makes a binary choice per layer, DARE-TIES produces a continuous interpolation of both models' knowledge at every weight.
Why No DPO/SFT
We explored DPO post-training but abandoned it after careful analysis. The primary dataset (GLM-5.1-Reasoning-1M-Cleaned) has a median chosen length of 3,225 tokens. With max_length=512, 95.8% of examples get truncated — the chosen response (with full thinking) gets cut short while the rejected (without thinking) remains intact, inverting the DPO signal. This was confirmed empirically: loss dropped from 0.51 to 0.069 in 20 steps (false convergence from corrupted signal).
Correct DPO for this class of model requires max_length ≥ 4,096 and complete long-CoT pairs — addressed in a future version.
References
- Rethinking Generalization in Reasoning SFT (Ren et al., 2026) — arXiv:2604.06628
- DARE — Language Model Merging by Uncertainty-Based Model Pruning
- TIES — Resolving Interference When Merging Models
- Jackrong's fine-tuning guide — GitHub
Acknowledgements
- Jackrong — both source models, training pipelines, datasets, and documentation
- Qwen team — Qwen3.5-9B base model
- GLM-5.1 team — teacher model used in distillation
- Kassadin88 — original GLM-5.1-1000000x dataset
- KyleHessling1 — Qwopus-GLM-18B-Merged reference benchmark
Citation
@misc{rico03_qwopus_glm_dualreason,
title = {Qwopus-GLM-9B-DualReason-Distilled},
author = {rico03},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/rico03/Qwopus-GLM-9B-DualReason-Distilled}
}
- Downloads last month
- 640
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit