Qwen3.6 35B-A3B - RotorQuant MLX 2-bit

2-bit weight-quantized MLX version of Qwen/Qwen3.6-35B-A3B with RotorQuant KV-cache quantization. Optimized for Apple Silicon inference via the MLX framework. RotorQuant delivers 5.3x faster prefill and 28% faster decode compared to TurboQuant. The most aggressive quantization, fitting the full model in the smallest possible footprint. Only 3B parameters are active per token despite 26B total, making this model significantly more efficient at inference time than its parameter count suggests.

Approximate model size: ~9 GB

Model Specifications

Property Value
Base Model Qwen/Qwen3.6-35B-A3B
Parameters 35 billion total (3 billion active per token)
Architecture Mixture-of-Experts (MoE) (3B active per token)
Modality Multimodal: image + video + text input, text output
License Apache 2.0
Weight Quantization 2-bit (~9 GB)
KV-Cache Quantization RotorQuant
Framework MLX (Apple Silicon)

Quickstart

import mlx.core as mx
from mlx_lm import load, generate

model, tokenizer = load("majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-2bit")

prompt = "Describe this image in detail."
response = generate(model, tokenizer, prompt=prompt, max_tokens=512)
print(response)

For multimodal usage with images:

from mlx_vlm import load, generate

model, processor = load("majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-2bit")

prompt = "What do you see in this image?"
output = generate(model, processor, prompt=prompt, image="path/to/image.jpg", max_tokens=512)
print(output)

What is RotorQuant?

RotorQuant is a high-performance KV-cache quantization method that achieves significantly better throughput than TurboQuant. Combined with 2-bit weight quantization in MLX, this provides maximum compression with the best available KV-cache performance: the smallest possible model footprint plus the fastest compressed KV cache for efficient long-context generation.

Key advantages over TurboQuant:

  • 5.3x faster prefill
  • 28% faster decode
  • Equivalent memory savings

Note: 2-bit quantization is the most aggressive option and may result in some quality degradation compared to higher-precision variants. It is best suited for experimentation, rapid prototyping, or hardware-constrained environments.

KV-Cache Quantization Comparison

Method Prefill Speed Decode Speed Memory Savings Reference
TurboQuant 1x (baseline) 1x (baseline) High arXiv: 2504.19874
RotorQuant 5.3x faster 28% faster High GitHub

Memory Estimates (Qwen3.6 35B-A3B)

Precision Approximate Size MLX Variant
FP16 (original) ~70 GB (approx.) --
8-bit quantized ~35 GB RotorQuant-MLX-8bit
4-bit quantized ~18 GB RotorQuant-MLX-4bit
2-bit quantized ~9 GB This model

Hardware Requirements

This model requires approximately 9 GB of unified memory. Recommended hardware:

  • Apple M1 (16 GB+)
  • Apple M2 (16 GB+)
  • Apple M3 (16 GB+)
  • Apple M4 (16 GB+)
  • Any Apple Silicon Mac with 16 GB+ unified memory

See Also

Quant trade-off (MLX lane)

Bits Approx size Use case Recommendation
2-bit ~9.1 GB Aggressive quantization Very low-RAM Macs
3-bit ~13 GB Lossy but small Low-RAM Macs
4-bit ~15 GB Balanced default Recommended for most Macs
5-bit ~18 GB Higher fidelity Quality-sensitive
6-bit ~21 GB Approaching FP16 quality High-fidelity
8-bit ~27 GB Near-lossless reference Fidelity-critical work

(Current variant — 2bit — is bolded.)

Variants in this family

(Showing 24 sibling variants under majentik/qwen3.6-35b-a3b-*. The current variant — RotorQuant-MLX-2bit — is bolded.)

Variant Runtime Approx size Use case
RotorQuant runtime modifier n/a KV-cache root (weight-agnostic)
RotorQuant-AWQ-4bit transformers ~22 GB GPU 4-bit (AutoAWQ)
RotorQuant-AWQ-8bit transformers ~38 GB GPU 8-bit (AutoAWQ)
RotorQuant-GGUF-IQ4_XS llama.cpp ~30 GB Lossy 4-bit, low-RAM CPU/edge
RotorQuant-GGUF-Q2_K llama.cpp ~21 GB Lossy, low-RAM CPU/edge
RotorQuant-GGUF-Q3_K_M llama.cpp ~27 GB Smaller 3-bit, CPU-friendly
RotorQuant-GGUF-Q4_K_M llama.cpp ~38 GB Balanced default
RotorQuant-GGUF-Q5_K_M llama.cpp ~46 GB Higher fidelity, more RAM
RotorQuant-GGUF-Q8_0 llama.cpp ~74 GB Near-lossless reference
RotorQuant-MLX-2bit mlx-lm ~11 GB Apple Silicon, smallest
RotorQuant-MLX-3bit mlx-lm ~16 GB Apple Silicon, small
RotorQuant-MLX-4bit mlx-lm ~22 GB Apple Silicon balanced
RotorQuant-MLX-5bit mlx-lm ~27 GB Apple Silicon, higher fidelity
RotorQuant-MLX-6bit mlx-lm ~32 GB Apple Silicon, near-lossless
RotorQuant-MLX-8bit mlx-lm ~41 GB Apple Silicon reference
TurboQuant runtime modifier n/a KV-cache root (weight-agnostic)
TurboQuant-AWQ-4bit transformers ~22 GB GPU 4-bit (AutoAWQ)
TurboQuant-AWQ-8bit transformers ~38 GB GPU 8-bit (AutoAWQ)
TurboQuant-MLX-2bit mlx-lm ~11 GB Apple Silicon, smallest
TurboQuant-MLX-3bit mlx-lm ~16 GB Apple Silicon, small
TurboQuant-MLX-4bit mlx-lm ~22 GB Apple Silicon balanced
TurboQuant-MLX-5bit mlx-lm ~27 GB Apple Silicon, higher fidelity
TurboQuant-MLX-6bit mlx-lm ~32 GB Apple Silicon, near-lossless
TurboQuant-MLX-8bit mlx-lm ~41 GB Apple Silicon reference
Downloads last month
4,268
Safetensors
Model size
35B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

2-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-2bit

Quantized
(298)
this model

Paper for majentik/Qwen3.6-35B-A3B-RotorQuant-MLX-2bit