ik_llama.cpp imatrix Quantizations of Qwen/Qwen3-235B-A22B-Thinking-2507

This quant collection REQUIRES ik_llama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!

NOTE ik_llama.cpp can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.

Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP.

These quants provide best in class perplexity for the given memory footprint.

Big Thanks

Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!!

Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models!

Quant Collection

Perplexity computed against wiki.test.raw. These first two are just test quants for baseline perplexity comparison:

Perplexity Chart

  • bf16 437.989 GiB (16.003 BPW)
    • Final estimate: PPL = 4.1898 +/- 0.02367
  • Q8_0 232.769 GiB (8.505 BPW)
    • Final estimate: PPL = 4.1956 +/- 0.02371

IQ5_K 161.722 GiB (5.909 BPW)

Final estimate: PPL = 4.2213 +/- 0.02391

πŸ‘ˆ Secret Recipe
#!/usr/bin/env bash

# Repeating Layers [0-93]

custom="
# Attention
blk\..*\.attn_q.*=iq6_k
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=iq6_k

# Routed Experts
blk\..*\.ffn_down_exps\.weight=iq6_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq5_k

# Token Embedding
token_embd\.weight=iq6_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N 0 -m 0 \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Thinking-2507-GGUF/imatrix-Qwen3-235B-A22B-Thinking-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Thinking-2507-GGUF/Qwen3-235B-A22B-Thinking-2507-BF16-00001-of-00010.gguf \
    /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Thinking-2507-GGUF/Qwen3-235B-A22B-Thinking-2507-IQ5_K.gguf \
    IQ5_K \
    192

IQ4_K 134.183 GiB (4.903 BPW)

Final estimate: PPL = 4.2407 +/- 0.02406

πŸ‘ˆ Secret Recipe
#!/usr/bin/env bash

# Repeating Layers [0-93]

custom="
# Attention
blk\..*\.attn_q.*=iq6_k
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=iq6_k

# Routed Experts
blk\..*\.ffn_down_exps\.weight=iq5_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_k

# Token Embedding
token_embd\.weight=iq6_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N 1 -m 1 \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Thinking-2507-GGUF/imatrix-Qwen3-235B-A22B-Thinking-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Thinking-2507-GGUF/Qwen3-235B-A22B-Thinking-2507-BF16-00001-of-00010.gguf \
    /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Thinking-2507-GGUF/Qwen3-235B-A22B-Thinking-2507-IQ4_K.gguf \
    IQ4_K \
    192

IQ4_KSS 114.093 GiB (4.169 BPW)

Final estimate: PPL = 4.2799 +/- 0.02423

This one is a little funky just for fun.

πŸ‘ˆ Secret Recipe
#!/usr/bin/env bash

# Repeating Layers [0-93]

custom="
# Attention
blk\..*\.attn_q.*=iq5_ks
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=iq5_ks

# Routed Experts
blk\.(0|1|2)\.ffn_down_exps\.weight=iq5_ks
blk\.(0|1|2)\.ffn_(gate|up)_exps\.weight=iq4_ks
blk\..*\.ffn_down_exps\.weight=iq4_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss

# Token Embedding
token_embd\.weight=iq6_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N 0 -m 0 \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Thinking-2507-GGUF/imatrix-Qwen3-235B-A22B-Thinking-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Thinking-2507-GGUF/Qwen3-235B-A22B-Thinking-2507-BF16-00001-of-00010.gguf \
    /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Thinking-2507-GGUF/Qwen3-235B-A22B-Thinking-2507-IQ4_KSS.gguf \
    IQ4_KSS \
    192

IQ3_K 106.644 GiB (3.897 BPW)

Final estimate: PPL = 4.3319 +/- 0.02470

πŸ‘ˆ Secret Recipe
#!/usr/bin/env bash

# Repeating Layers [0-93]

custom="
# Attention
blk\..*\.attn_q.*=iq6_k
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=iq6_k

# Routed Experts
blk\..*\.ffn_down_exps\.weight=iq4_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq3_k

# Token Embedding
token_embd\.weight=iq6_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N 1 -m 1 \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Thinking-2507-GGUF/imatrix-Qwen3-235B-A22B-Thinking-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Thinking-2507-GGUF/Qwen3-235B-A22B-Thinking-2507-BF16-00001-of-00010.gguf \
    /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Thinking-2507-GGUF/Qwen3-235B-A22B-Thinking-2507-IQ3_K.gguf \
    IQ3_K \
    192

IQ3_KS 101.308 GiB (3.702 BPW)

Final estimate: PPL = 4.3718 +/- 0.02509

πŸ‘ˆ Secret Recipe
#!/usr/bin/env bash

# Repeating Layers [0-93]

custom="
# Attention
blk\..*\.attn_q.*=iq6_k
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=iq6_k

# Routed Experts
blk\.(0|1|2|3)\.ffn_down_exps\.weight=iq5_ks
blk\.(0|1|2|3)\.ffn_(gate|up)_exps\.weight=iq4_ks
blk\..*\.ffn_down_exps\.weight=iq4_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq3_ks

# Token Embedding
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N 0 -m 0 \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Thinking-2507-GGUF/imatrix-Qwen3-235B-A22B-Thinking-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Thinking-2507-GGUF/Qwen3-235B-A22B-Thinking-2507-BF16-00001-of-00010.gguf \
    /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Thinking-2507-GGUF/Qwen3-235B-A22B-Thinking-2507-IQ3_KS.gguf \
    IQ3_KS \
    192

IQ2_KL 81.866 GiB (2.991 BPW)

Final estimate: PPL = 4.6608 +/- 0.02720

πŸ‘ˆ Secret Recipe
#!/usr/bin/env bash

# Repeating Layers [0-93]

custom="
# Attention
blk\..*\.attn_q.*=iq6_k
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=iq6_k

# Routed Experts
blk\..*\.ffn_down_exps\.weight=iq3_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl

# Token Embedding
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N 0 -m 0 \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Thinking-2507-GGUF/imatrix-Qwen3-235B-A22B-Thinking-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Thinking-2507-GGUF/Qwen3-235B-A22B-Thinking-2507-BF16-00001-of-00010.gguf \
    /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Thinking-2507-GGUF/Qwen3-235B-A22B-Thinking-2507-IQ2_KL.gguf \
    IQ2_KL \
    192

Quick Start

This example is for a single CUDA GPU hybrid infrencing with CPU/RAM. Check ik_llama.cpp discussions or my other quants for more examples for multi-GPU etc.

./build/bin/llama-server \
  --model /models/IQ5_K/Qwen3-235B-A22B-Thinking-IQ5_K-00001-of-00004.gguf \
  --alias ubergarm/Qwen3-235B-A22B-Thinking-2507 \
  -fa -fmoe \
  -ctk q8_0 -ctv q8_0 \
  -c 32768 \
  -ngl 99 \
  -ot "blk\.[0-9]\.ffn.*=CUDA0" \
  -ot "blk.*\.ffn.*=CPU \
  --threads 16 \
  -ub 4096 -b 4096 \
  --host 127.0.0.1 \
  --port 8080

References

Downloads last month
616
GGUF
Model size
235B params
Architecture
qwen3moe
Hardware compatibility
Log In to view the estimation

2-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ubergarm/Qwen3-235B-A22B-Thinking-2507-GGUF

Quantized
(29)
this model