Instructions to use majentik/Qwen3.5-27B-RotorQuant-MLX-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use majentik/Qwen3.5-27B-RotorQuant-MLX-4bit with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Qwen3.5-27B-RotorQuant-MLX-4bit majentik/Qwen3.5-27B-RotorQuant-MLX-4bit
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
Qwen3.5-27B-RotorQuant-MLX-4bit
MLX 4-bit weight-quantized variant of Qwen/Qwen3.5-27B with RotorQuant KV cache compression for efficient inference on Apple Silicon.
Overview
This model combines two complementary compression techniques:
- MLX 4-bit weight quantization (affine, group size 64) — reduces model size from ~54GB to ~15GB
- RotorQuant KV cache compression — compresses key-value caches during inference using Clifford algebra block-diagonal rotations, enabling longer contexts with less VRAM
Quickstart
from mlx_lm import load, generate
from turboquant import IsoQuantCache
model, tokenizer = load("majentik/Qwen3.5-27B-RotorQuant-MLX-4bit")
# Standard generation
prompt = "Explain the theory of relativity"
response = generate(model, tokenizer, prompt=prompt, max_tokens=2048)
print(response)
Specifications
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3.5-27B |
| Parameters | 27B |
| Weight Quantization | MLX 4-bit affine (group size 64) |
| KV Cache Method | RotorQuant (Clifford algebra block-diagonal rotations) |
| Model Size | ~15 GB |
| Context Length | 262K (native), 1M+ (extended) |
| Platform | Apple Silicon (M1/M2/M3/M4/M5) |
What is RotorQuant?
RotorQuant uses Clifford algebra block-diagonal rotations for KV cache quantization, achieving superior efficiency compared to vector-quantization-based approaches like TurboQuant:
| Metric | RotorQuant | TurboQuant |
|---|---|---|
| Prefill speed | 5.3x faster | baseline |
| Decode speed | 28% faster | baseline |
| Quantizer parameters | 44x fewer | baseline |
| Perplexity | 6.91 | 7.07 |
RotorQuant compresses the KV cache to ~3-bit effective precision while maintaining lower perplexity than TurboQuant's 4-bit approach, thanks to the mathematical efficiency of geometric algebra rotations.
Thinking Mode
Qwen3.5-27B generates extended reasoning before responses by default. The combination of weight quantization and KV cache compression is especially valuable here — thinking tokens consume significant memory that is reduced by both techniques working together.
Memory Estimate
| Configuration | Model Weights | KV Cache (128K ctx) | Total |
|---|---|---|---|
| FP16 (baseline) | ~54 GB | ~13 GB | ~67 GB |
| MLX 4-bit + RotorQuant | ~15 GB | ~1.3 GB | ~16.3 GB |
See Also
Quant trade-off (MLX lane)
| Bits | Approx size | Use case | Recommendation |
|---|---|---|---|
| 2-bit | ~7.3 GB | Aggressive quantization | Very low-RAM Macs |
| 3-bit | ~10 GB | Lossy but small | Low-RAM Macs |
| 4-bit | ~12 GB | Balanced default | Recommended for most Macs |
| 5-bit | ~14 GB | Higher fidelity | Quality-sensitive |
| 6-bit | ~17 GB | Approaching FP16 quality | High-fidelity |
| 8-bit | ~21 GB | Near-lossless reference | Fidelity-critical work |
(Current variant — 4bit — is bolded.)
Variants in this family
(Showing 16 sibling variants under majentik/qwen3.5-27b-*. The current variant — RotorQuant-MLX-4bit — is bolded.)
| Variant | Runtime | Approx size | Use case |
|---|---|---|---|
| RotorQuant | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| RotorQuant-2bit | transformers | n/a | Standalone 2-bit weights |
| RotorQuant-GGUF-IQ4_XS | llama.cpp | ~23 GB | Lossy 4-bit, low-RAM CPU/edge |
| RotorQuant-GGUF-Q2_K | llama.cpp | ~16 GB | Lossy, low-RAM CPU/edge |
| RotorQuant-GGUF-Q3_K_M | llama.cpp | ~21 GB | Smaller 3-bit, CPU-friendly |
| RotorQuant-GGUF-Q4_K_M | llama.cpp | ~30 GB | Balanced default |
| RotorQuant-GGUF-Q5_K_M | llama.cpp | ~36 GB | Higher fidelity, more RAM |
| RotorQuant-GGUF-Q8_0 | llama.cpp | ~57 GB | Near-lossless reference |
| RotorQuant-MLX-2bit | mlx-lm | ~8.6 GB | Apple Silicon, smallest |
| RotorQuant-MLX-4bit | mlx-lm | ~17 GB | Apple Silicon balanced |
| RotorQuant-MLX-8bit | mlx-lm | ~32 GB | Apple Silicon reference |
| TurboQuant | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| TurboQuant-2bit | transformers | n/a | Standalone 2-bit weights |
| TurboQuant-MLX-2bit | mlx-lm | ~8.6 GB | Apple Silicon, smallest |
| TurboQuant-MLX-4bit | mlx-lm | ~17 GB | Apple Silicon balanced |
| TurboQuant-MLX-8bit | mlx-lm | ~32 GB | Apple Silicon reference |
- Downloads last month
- 423
4-bit
Model tree for majentik/Qwen3.5-27B-RotorQuant-MLX-4bit
Base model
Qwen/Qwen3.5-27B