Instructions to use Kevletesteur/Qwen3.5-35B-A3B-RAMP-v2-15G with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use Kevletesteur/Qwen3.5-35B-A3B-RAMP-v2-15G with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Kevletesteur/Qwen3.5-35B-A3B-RAMP-v2-15G", filename="Qwen3.5-35B-A3B-RAMP-v2-15g.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use Kevletesteur/Qwen3.5-35B-A3B-RAMP-v2-15G with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Kevletesteur/Qwen3.5-35B-A3B-RAMP-v2-15G # Run inference directly in the terminal: llama-cli -hf Kevletesteur/Qwen3.5-35B-A3B-RAMP-v2-15G
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Kevletesteur/Qwen3.5-35B-A3B-RAMP-v2-15G # Run inference directly in the terminal: llama-cli -hf Kevletesteur/Qwen3.5-35B-A3B-RAMP-v2-15G
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Kevletesteur/Qwen3.5-35B-A3B-RAMP-v2-15G # Run inference directly in the terminal: ./llama-cli -hf Kevletesteur/Qwen3.5-35B-A3B-RAMP-v2-15G
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Kevletesteur/Qwen3.5-35B-A3B-RAMP-v2-15G # Run inference directly in the terminal: ./build/bin/llama-cli -hf Kevletesteur/Qwen3.5-35B-A3B-RAMP-v2-15G
Use Docker
docker model run hf.co/Kevletesteur/Qwen3.5-35B-A3B-RAMP-v2-15G
- LM Studio
- Jan
- vLLM
How to use Kevletesteur/Qwen3.5-35B-A3B-RAMP-v2-15G with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Kevletesteur/Qwen3.5-35B-A3B-RAMP-v2-15G" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Kevletesteur/Qwen3.5-35B-A3B-RAMP-v2-15G", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Kevletesteur/Qwen3.5-35B-A3B-RAMP-v2-15G
- Ollama
How to use Kevletesteur/Qwen3.5-35B-A3B-RAMP-v2-15G with Ollama:
ollama run hf.co/Kevletesteur/Qwen3.5-35B-A3B-RAMP-v2-15G
- Unsloth Studio new
How to use Kevletesteur/Qwen3.5-35B-A3B-RAMP-v2-15G with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Kevletesteur/Qwen3.5-35B-A3B-RAMP-v2-15G to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Kevletesteur/Qwen3.5-35B-A3B-RAMP-v2-15G to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Kevletesteur/Qwen3.5-35B-A3B-RAMP-v2-15G to start chatting
- Pi new
How to use Kevletesteur/Qwen3.5-35B-A3B-RAMP-v2-15G with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Kevletesteur/Qwen3.5-35B-A3B-RAMP-v2-15G
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Kevletesteur/Qwen3.5-35B-A3B-RAMP-v2-15G" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Kevletesteur/Qwen3.5-35B-A3B-RAMP-v2-15G with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Kevletesteur/Qwen3.5-35B-A3B-RAMP-v2-15G
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Kevletesteur/Qwen3.5-35B-A3B-RAMP-v2-15G
Run Hermes
hermes
- Docker Model Runner
How to use Kevletesteur/Qwen3.5-35B-A3B-RAMP-v2-15G with Docker Model Runner:
docker model run hf.co/Kevletesteur/Qwen3.5-35B-A3B-RAMP-v2-15G
- Lemonade
How to use Kevletesteur/Qwen3.5-35B-A3B-RAMP-v2-15G with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Kevletesteur/Qwen3.5-35B-A3B-RAMP-v2-15G
Run and chat with the model
lemonade run user.Qwen3.5-35B-A3B-RAMP-v2-15G-{{QUANT_TAG}}List all available models
lemonade list
Qwen3.5-35B-A3B — RAMP v2 (15.2 GB)
Hardware-optimized GGUF quantization of Qwen3.5-35B-A3B for RTX 5060 Ti 16GB.
Produced with RAMP (RL-guided Adaptive Mixed-Precision), a data-free quantization pipeline that uses per-tensor sensitivity analysis and evolutionary search to find the optimal mixed-precision configuration for your specific hardware.
Key specs
| Metric | Value |
|---|---|
| Base model | Qwen/Qwen3.5-35B-A3B |
| File size | 15.2 GB |
| Average BPW | 3.78 |
| Base quant type | IQ3_S |
| Critical path overrides | Q8_0 (SSM gates, norms), Q6_K/Q5_K (attention QKV, shared expert) |
| Generation speed | ik_llama-server), RTX 5060 Ti, sm_120, -ngl 99 --n-cpu-moe 4 |
| Context | 32K tokens (q8_0 keys + q4_0 values KV cache) |
| Functional benchmark | 30/30 |
| VRAM usage | ~14 GB (GPU) + CPU experts |
What makes RAMP different
Standard quantization applies the same precision to all tensors. RAMP assigns per-tensor precision based on sensitivity analysis:
- SSM gates and norms → Q8_0 (critical for GDN recurrent state stability)
- Attention Q/K/V projections → Q5_K/Q6_K (quality-sensitive)
- MoE shared expert → Q5_K (always active, high impact)
- MoE routed experts → IQ3_S (256 experts, only 8 active per token)
This is built with a custom imatrix calibrated on French + English + code + clinical (kiné) data, not generic wiki text.
How to use
# With ik_llama.cpp (recommended for sm_120 GPUs)
./llama-server \
-m Qwen3.5-35B-A3B-RAMP-v2-15g.gguf \
-ngl 99 --n-cpu-moe 4 \
-np 1 -c 32768 \
--cache-type-k q8_0 --cache-type-v q4_0
# With stock llama.cpp
./llama-server \
-m Qwen3.5-35B-A3B-RAMP-v2-15g.gguf \
-ngl 99 \
-c 32768
Quantization pipeline
- Start from Qwen3.5-35B-A3B BF16 (Unsloth GGUF)
- Custom imatrix: domain-calibrated (BFCL + MoT + Codeforces + French clinical)
- RAMP sensitivity analysis: per-tensor NSDS scoring (data-free)
llama-quantize --imatrix chimere --custom-qwith 317 tensor overrides- Validation: 30/30 functional bench, perplexity check
Previous versions:
- RAMP v1 (17 GB, Q3_K_M base, no imatrix) — backup
- IQ3_S custom-mix (14.71 GB, 3.56 BPW) — backup
Hardware tested
- GPU: NVIDIA RTX 5060 Ti 16GB (Blackwell, sm_120)
- CPU: Intel i5-14600KF
- RAM: 32GB DDR5
- Driver: 590.48 (CUDA 12.8 toolkit)
Related
- chimere — Rust inference runtime
- ramp-quant — Quantization pipeline source code
- chimere-odo — Inference orchestrator
Author
Kevin Remondiere — Independent ML researcher, Bayonne, France
License
Apache 2.0 (quantization pipeline and model card). The base model follows Qwen's license.
- Downloads last month
- 15
We're not able to determine the quantization variants.