Qwen3.5-35B-A3B — RAMP v2 (15.2 GB)

Hardware-optimized GGUF quantization of Qwen3.5-35B-A3B for RTX 5060 Ti 16GB.

Produced with RAMP (RL-guided Adaptive Mixed-Precision), a data-free quantization pipeline that uses per-tensor sensitivity analysis and evolutionary search to find the optimal mixed-precision configuration for your specific hardware.

Key specs

Metric Value
Base model Qwen/Qwen3.5-35B-A3B
File size 15.2 GB
Average BPW 3.78
Base quant type IQ3_S
Critical path overrides Q8_0 (SSM gates, norms), Q6_K/Q5_K (attention QKV, shared expert)
Generation speed 80 tok/s on the chimere-server HTTP path (90 tok/s bare ik_llama-server), RTX 5060 Ti, sm_120, -ngl 99 --n-cpu-moe 4
Context 32K tokens (q8_0 keys + q4_0 values KV cache)
Functional benchmark 30/30
VRAM usage ~14 GB (GPU) + CPU experts

What makes RAMP different

Standard quantization applies the same precision to all tensors. RAMP assigns per-tensor precision based on sensitivity analysis:

  • SSM gates and norms → Q8_0 (critical for GDN recurrent state stability)
  • Attention Q/K/V projections → Q5_K/Q6_K (quality-sensitive)
  • MoE shared expert → Q5_K (always active, high impact)
  • MoE routed experts → IQ3_S (256 experts, only 8 active per token)

This is built with a custom imatrix calibrated on French + English + code + clinical (kiné) data, not generic wiki text.

How to use

# With ik_llama.cpp (recommended for sm_120 GPUs)
./llama-server \
  -m Qwen3.5-35B-A3B-RAMP-v2-15g.gguf \
  -ngl 99 --n-cpu-moe 4 \
  -np 1 -c 32768 \
  --cache-type-k q8_0 --cache-type-v q4_0

# With stock llama.cpp
./llama-server \
  -m Qwen3.5-35B-A3B-RAMP-v2-15g.gguf \
  -ngl 99 \
  -c 32768

Quantization pipeline

  1. Start from Qwen3.5-35B-A3B BF16 (Unsloth GGUF)
  2. Custom imatrix: domain-calibrated (BFCL + MoT + Codeforces + French clinical)
  3. RAMP sensitivity analysis: per-tensor NSDS scoring (data-free)
  4. llama-quantize --imatrix chimere --custom-q with 317 tensor overrides
  5. Validation: 30/30 functional bench, perplexity check

Previous versions:

  • RAMP v1 (17 GB, Q3_K_M base, no imatrix) — backup
  • IQ3_S custom-mix (14.71 GB, 3.56 BPW) — backup

Hardware tested

  • GPU: NVIDIA RTX 5060 Ti 16GB (Blackwell, sm_120)
  • CPU: Intel i5-14600KF
  • RAM: 32GB DDR5
  • Driver: 590.48 (CUDA 12.8 toolkit)

Related

Author

Kevin Remondiere — Independent ML researcher, Bayonne, France

License

Apache 2.0 (quantization pipeline and model card). The base model follows Qwen's license.

Downloads last month
15
GGUF
Model size
35B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Kevletesteur/Qwen3.5-35B-A3B-RAMP-v2-15G

Quantized
(258)
this model