Instructions to use XpressAI/Qwen3.6-27B-RYS-UD-Q4_K_XL-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use XpressAI/Qwen3.6-27B-RYS-UD-Q4_K_XL-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="XpressAI/Qwen3.6-27B-RYS-UD-Q4_K_XL-GGUF",
	filename="Qwen3.6-27B-rys_33-36-UD-Q4_K_XL.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use XpressAI/Qwen3.6-27B-RYS-UD-Q4_K_XL-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf XpressAI/Qwen3.6-27B-RYS-UD-Q4_K_XL-GGUF:UD-Q4_K_XL
# Run inference directly in the terminal:
llama-cli -hf XpressAI/Qwen3.6-27B-RYS-UD-Q4_K_XL-GGUF:UD-Q4_K_XL

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf XpressAI/Qwen3.6-27B-RYS-UD-Q4_K_XL-GGUF:UD-Q4_K_XL
# Run inference directly in the terminal:
llama-cli -hf XpressAI/Qwen3.6-27B-RYS-UD-Q4_K_XL-GGUF:UD-Q4_K_XL

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf XpressAI/Qwen3.6-27B-RYS-UD-Q4_K_XL-GGUF:UD-Q4_K_XL
# Run inference directly in the terminal:
./llama-cli -hf XpressAI/Qwen3.6-27B-RYS-UD-Q4_K_XL-GGUF:UD-Q4_K_XL

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf XpressAI/Qwen3.6-27B-RYS-UD-Q4_K_XL-GGUF:UD-Q4_K_XL
# Run inference directly in the terminal:
./build/bin/llama-cli -hf XpressAI/Qwen3.6-27B-RYS-UD-Q4_K_XL-GGUF:UD-Q4_K_XL

Use Docker

docker model run hf.co/XpressAI/Qwen3.6-27B-RYS-UD-Q4_K_XL-GGUF:UD-Q4_K_XL

LM Studio
Jan
Ollama
How to use XpressAI/Qwen3.6-27B-RYS-UD-Q4_K_XL-GGUF with Ollama:
```
ollama run hf.co/XpressAI/Qwen3.6-27B-RYS-UD-Q4_K_XL-GGUF:UD-Q4_K_XL
```

Unsloth Studio new

How to use XpressAI/Qwen3.6-27B-RYS-UD-Q4_K_XL-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for XpressAI/Qwen3.6-27B-RYS-UD-Q4_K_XL-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for XpressAI/Qwen3.6-27B-RYS-UD-Q4_K_XL-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for XpressAI/Qwen3.6-27B-RYS-UD-Q4_K_XL-GGUF to start chatting

Pi new

How to use XpressAI/Qwen3.6-27B-RYS-UD-Q4_K_XL-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf XpressAI/Qwen3.6-27B-RYS-UD-Q4_K_XL-GGUF:UD-Q4_K_XL

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "XpressAI/Qwen3.6-27B-RYS-UD-Q4_K_XL-GGUF:UD-Q4_K_XL"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use XpressAI/Qwen3.6-27B-RYS-UD-Q4_K_XL-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf XpressAI/Qwen3.6-27B-RYS-UD-Q4_K_XL-GGUF:UD-Q4_K_XL

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default XpressAI/Qwen3.6-27B-RYS-UD-Q4_K_XL-GGUF:UD-Q4_K_XL

Run Hermes

hermes

Docker Model Runner
How to use XpressAI/Qwen3.6-27B-RYS-UD-Q4_K_XL-GGUF with Docker Model Runner:
```
docker model run hf.co/XpressAI/Qwen3.6-27B-RYS-UD-Q4_K_XL-GGUF:UD-Q4_K_XL
```

Lemonade

How to use XpressAI/Qwen3.6-27B-RYS-UD-Q4_K_XL-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull XpressAI/Qwen3.6-27B-RYS-UD-Q4_K_XL-GGUF:UD-Q4_K_XL

Run and chat with the model

lemonade run user.Qwen3.6-27B-RYS-UD-Q4_K_XL-GGUF-UD-Q4_K_XL

List all available models

lemonade list

Qwen3.6-27B — RYS Layer Surgery (GGUF)

A modified version of Qwen3.6-27B-Instruct produced by RYS layer duplication — no training, no weight changes, just running layers 33–36 a second time during the forward pass.

Based on David Ng's RYS method.

TL;DR

On the Berkeley Function-Call Leaderboard (BFCL v4, 100 tests/category × 13 single-turn categories, sampled), this variant beats the unmodified base model by +1.96 pp on average when run with thinking mode enabled — driven by large gains on the hardest live categories:

Category	Base	rys_33-36	Δ
live_parallel	68.75%	87.50%	+18.75
live_relevance	68.75%	81.25%	+12.50
live_parallel_multiple	70.83%	75.00%	+4.17
mean (13 categories)	82.56%	84.52%	+1.96

The wins come from improved reasoning during prefill on multi-call / relevance-judgement queries. The trade is small regressions (−1 to −3 pp) on easier non-live categories. Thinking mode is required — without it, this variant slightly underperforms base.

Files

File	Layers	Size
`Qwen3.6-27B-rys_33-36-UD-Q4_K_XL.gguf`	68	18 GiB

The base GGUF (no surgery) is at unsloth/Qwen3.6-27B-GGUF.

Internal probe results

A small probe of math, EQ, and reasoning prompts was run during the layer search. The probe categories are tiny (3 questions per reasoning subcategory, ~16 EQ-Bench-style items, ~16 math problems) so individual numbers should be treated as directional, not definitive.

Metric	Base	rys_33-36
Math (GSM8K-style partial credit)	0.537	0.500
EQ (EQ-Bench-style, 0–100)	93.59	86.64
Reasoning total (17 probes, 5 subcategories)	0.765	0.882
↳ causal	0.67	1.00
↳ date	1.00	1.00
↳ logic	1.00	1.00
↳ navigation	0.67	1.00
↳ gsm	0.60	0.60

Layers 33–36 was the only configuration in the layer-block sweep that achieved a perfect score on the causal reasoning subcategory while keeping the other reasoning categories at or above their baseline. This is what motivated picking it for the BFCL run below.

BFCL results (sampled, thinking enabled)

Category	Base	rys_33-36
irrelevance	90.00	88.00
multiple	96.00	95.00
parallel	93.00	91.00
parallel_multiple	87.00	85.00
simple_java	59.00	61.00
simple_javascript	74.00	72.00
simple_python	95.00	92.00
live_irrelevance	98.00	99.00
live_multiple	88.00	87.00
live_parallel	68.75	87.50
live_parallel_multiple	70.83	75.00
live_relevance	68.75	81.25
live_simple	85.00	85.00
mean	82.56	84.52

Sample size: 100 tests/category for categories with ≥100 entries; the full category was used for the smaller ones (live_parallel, live_parallel_multiple, live_relevance, simple_javascript). 1006 tests per model in total. The full benchmark would be ~5x larger and would also cover multi-turn, memory, and web-search categories that we did not run.

Inference: llama.cpp llama-server --jinja, BFCL via /v1/chat/completions with native tool use, temperature=1.0, top_p=0.95, top_k=20, max_tokens=8192. Multi-turn, memory, and web-search categories were not run.

What is RYS?

Transformers self-organise during training into functional circuits — contiguous blocks of layers that act together. RYS duplicates a specific block in the forward pass using the same weights:

Normal:    0 → … → 32 → 33 → 34 → 35 → 36 → 37 → … → 63
rys_33-36: 0 → … → 32 → 33 → 34 → 35 → 36
                       → 33 → 34 → 35 → 36 → 37 → … → 63

The model processes layers 33–36 twice. No fine-tuning, no extra parameters beyond the GGUF file overhead. Total layer count goes from 64 → 68.

How the layer range was found

A two-pass sweep across all 64 layers using a small probe of math, EQ, and reasoning prompts:

Pass 1 (8-layer blocks, stride 4): identified hot zones around layers 32–48 (math gains, causal reasoning) and 48–60 (general reasoning gains).
Pass 2 (4-layer blocks, stride 1, layers 32–58): (33, 37) was the only configuration that achieved a perfect score on the probe's causal reasoning subcategory while keeping date, logic, and nav at their baseline ceilings.

The probe alone suggested rys_33-36 was a moderate win. The sampled BFCL run with thinking enabled confirms it on the harder live categories (above).

Extended evaluation (Ng's protocol)

After a thoughtful question on the discussion forum about deviations from David Ng's suggested reproduction path, we went back and ran the steps we had skipped:

Extended probe — math_120 + eq_140 from Ng's repo, --reasoning off to match the protocol's intent (the math probe is designed for intuitive guessing, not deliberate computation):

Variant	math_120	eq_140
base	0.9986	74.53
rys_33-36	0.9930	78.81

On the larger probe rys_33-36 holds its EQ improvement (+4.28 pp). Math is at ceiling for both. Note this is the opposite direction from our small internal probe (where rys_33-36 had lower EQ) — small-probe variance was misleading us; the 140-question sample is the trustworthy reading.

Depth-2 beam search — 10 non-overlapping pair-combinations of the top single-block configs, each scored on the same probe:

Variant	math_120	eq_140
rys_33-36	0.9930	78.81
rys_33-36 + 49-52	0.9226	75.66
rys_33-36 + 53-56	0.9219	75.27
rys_33-36 + 54-57	0.9639	72.21
rys_33-36 + 56-59	0.9643	74.21
rys_33-36 + 58-61	0.9930	68.78
rys_49-52 + 53-56	0.8864	66.70
rys_49-52 + 56-59	0.9654	69.67
rys_49-52 + 58-61	0.9606	69.18
rys_53-56 + 58-61	0.9635	63.57
rys_54-57 + 58-61	0.9703	59.93

No depth-2 combination beats rys_33-36 on EQ_140. Stacking blocks degrades math (sometimes catastrophically) without improving EQ. So the shortcut we took in candidate selection (no beam search) did not cost us a better configuration in this neighborhood. We did not train Ng's surrogate regressor or run a deeper beam search — those would explore more of the configuration space and might find something better.

Hybrid Mamba/attention architecture constraint

Qwen3.6-27B is a hybrid SSM/attention model (full_attention_interval = 4): full attention every 4th layer, Gated DeltaNet SSM everywhere else. This creates a hard constraint: the total layer count must remain divisible by 4.

Block size 4 → 64 + 4 = 68 layers (68 ÷ 4 = 17 ✓)
Block size 3 → 64 + 3 = 67 layers (67 ÷ 4 = 16.75 ✗ → crash)

Usage

llama.cpp / llama-server

The wins require thinking mode. Use --jinja so the server applies the Qwen3.6 chat template, which primes thinking properly:

llama-server -m Qwen3.6-27B-rys_33-36-UD-Q4_K_XL.gguf \
             --jinja \
             -ngl 99 -c 32768 \
             --port 8080

Sampling parameters (Qwen3.6 thinking-mode defaults)

temperature = 1.0
top_p       = 0.95
top_k       = 20
min_p       = 0.0

For more deterministic / coding-focused tasks, Qwen recommends temperature=0.6 instead. Either way, leave thinking enabled.

Token budget

Qwen3.6's thinking chains can be long (we observed up to ~7k tokens of reasoning on hard BFCL parallel cases). Set max_tokens ≥ 8192 to avoid truncating mid-thought.

VRAM

About 22 GiB at Q4_K_XL with 32k context and Q8 KV cache. Fits comfortably on a single A100 40 GB.

When to use this

You want better function-calling performance on complex live queries (parallel calls, relevance judgement) and you can afford ~6 extra layers of prefill compute.
You're running with thinking mode on (this is where the gain comes from).

When NOT to use this

You're running without thinking — base will be ~1.5 pp better.
You care about the very-easy categories (simple_python, multiple) more than the hard live ones — base is 1–3 pp better there.

Credits

David Ng for the original RYS method
Unsloth for the base UD-Q4_K_XL quantization
Qwen team for Qwen3.6-27B
llama.cpp for local inference
The Berkeley Function-Call Leaderboard for the eval harness

License

Apache 2.0 (inherited from base model).

Downloads last month: 1,344

GGUF

Model size

28B params

Architecture

qwen35

Hardware compatibility

4-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support