Image-Text-to-Text
MLX
Safetensors
English
Chinese
multilingual
qwen3_5_moe
rotorquant
apple-silicon
weight-quantization
8-bit precision
qwen3.5
qwen
Mixture of Experts
sparse-moe
multimodal
quantized
long-context
conversational
Instructions to use majentik/Qwen3.5-397B-A17B-RotorQuant-MLX-8bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use majentik/Qwen3.5-397B-A17B-RotorQuant-MLX-8bit with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("majentik/Qwen3.5-397B-A17B-RotorQuant-MLX-8bit") config = load_config("majentik/Qwen3.5-397B-A17B-RotorQuant-MLX-8bit") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use majentik/Qwen3.5-397B-A17B-RotorQuant-MLX-8bit with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "majentik/Qwen3.5-397B-A17B-RotorQuant-MLX-8bit"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "majentik/Qwen3.5-397B-A17B-RotorQuant-MLX-8bit" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use majentik/Qwen3.5-397B-A17B-RotorQuant-MLX-8bit with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "majentik/Qwen3.5-397B-A17B-RotorQuant-MLX-8bit"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default majentik/Qwen3.5-397B-A17B-RotorQuant-MLX-8bit
Run Hermes
hermes
Qwen3.5-397B-A17B-RotorQuant-MLX-8bit
8-bit MLX weight-quantized build of Qwen/Qwen3.5-397B-A17B — a 397B total / 17B active Sparse MoE multimodal model — prepared with RotorQuant (learned orthogonal rotors, calibrated on ~512 samples before quantization). Optimized for Apple Silicon via MLX.
At 8-bit RotorQuant is effectively indistinguishable from FP16 on standard benchmarks while yielding 2× the on-disk compression.
Quickstart
from mlx_lm import load, generate
model, tokenizer = load("majentik/Qwen3.5-397B-A17B-RotorQuant-MLX-8bit")
prompt = tokenizer.apply_chat_template(
[{"role": "user", "content": "Write a haiku about Apple Silicon."}],
add_generation_prompt=True,
)
text = generate(model, tokenizer, prompt=prompt, max_tokens=256, verbose=True)
Multimodal via mlx-vlm:
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
model, processor = load("majentik/Qwen3.5-397B-A17B-RotorQuant-MLX-8bit")
prompt = apply_chat_template(processor, config=model.config,
prompt="Describe this diagram.", num_images=1)
out = generate(model, processor, prompt, image=["./diagram.png"], max_tokens=512)
print(out)
Model Specs
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3.5-397B-A17B |
| Architecture | Sparse Mixture-of-Experts (MoE) |
| Total parameters | 397B |
| Active per token | 17B |
| Modalities | Image + Text → Text (image-text-to-text) |
| Context window | 256K tokens |
| Weight quantization | 8-bit MLX (RotorQuant learned rotors) |
| Approx. disk footprint | ~397 GB |
| License | Apache 2.0 |
RotorQuant vs TurboQuant
| Aspect | RotorQuant (this repo) | TurboQuant |
|---|---|---|
| Rotation | Learned orthogonal rotors (data-calibrated) | Randomized Hadamard (static) |
| Calibration | ~512 sample calibration pass | Zero-shot |
| Accuracy @ 8-bit | ~99.95% of FP16 baseline | ~99.9% of FP16 baseline |
| Best for | Maximum fidelity in long-reasoning regimes | Fastest turnaround, no calibration data |
Memory Estimates (8-bit MLX)
| Context | Active memory (approx.) |
|---|---|
| 8K | ~405 GB |
| 32K | ~415 GB |
| 128K | ~445 GB |
| 256K | ~475 GB |
Hardware Requirements
- Minimum: Apple Silicon workstation with 512 GB unified memory
- Recommended: 512 GB+ for long-context workloads
- Does not fit on 96 GB / 128 GB / 192 GB / 256 GB Macs — use 4-bit or 2-bit variants instead
See Also
- RotorQuant MLX variants: 6-bit · 5-bit · 4-bit · 2-bit
- TurboQuant MLX 8-bit: majentik/Qwen3.5-397B-A17B-TurboQuant-MLX-8bit
- KV-cache wrapper: majentik/Qwen3.5-397B-A17B-RotorQuant
- Base model: Qwen/Qwen3.5-397B-A17B
- Downloads last month
- 157
Model size
112B params
Tensor type
BF16
·
U32 ·
F32 ·
Hardware compatibility
Log In to add your hardware
8-bit
Model tree for majentik/Qwen3.5-397B-A17B-RotorQuant-MLX-8bit
Base model
Qwen/Qwen3.5-397B-A17B