Instructions to use g023/qwen3-tiny-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use g023/qwen3-tiny-v2 with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="g023/qwen3-tiny-v2", filename="Qwen3-g023-tiny-v2-Q2_K.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use g023/qwen3-tiny-v2 with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf g023/qwen3-tiny-v2:Q4_K_M # Run inference directly in the terminal: llama-cli -hf g023/qwen3-tiny-v2:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf g023/qwen3-tiny-v2:Q4_K_M # Run inference directly in the terminal: llama-cli -hf g023/qwen3-tiny-v2:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf g023/qwen3-tiny-v2:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf g023/qwen3-tiny-v2:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf g023/qwen3-tiny-v2:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf g023/qwen3-tiny-v2:Q4_K_M
Use Docker
docker model run hf.co/g023/qwen3-tiny-v2:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use g023/qwen3-tiny-v2 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "g023/qwen3-tiny-v2" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "g023/qwen3-tiny-v2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/g023/qwen3-tiny-v2:Q4_K_M
- Ollama
How to use g023/qwen3-tiny-v2 with Ollama:
ollama run hf.co/g023/qwen3-tiny-v2:Q4_K_M
- Unsloth Studio new
How to use g023/qwen3-tiny-v2 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for g023/qwen3-tiny-v2 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for g023/qwen3-tiny-v2 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for g023/qwen3-tiny-v2 to start chatting
- Pi new
How to use g023/qwen3-tiny-v2 with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf g023/qwen3-tiny-v2:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "g023/qwen3-tiny-v2:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use g023/qwen3-tiny-v2 with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf g023/qwen3-tiny-v2:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default g023/qwen3-tiny-v2:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use g023/qwen3-tiny-v2 with Docker Model Runner:
docker model run hf.co/g023/qwen3-tiny-v2:Q4_K_M
- Lemonade
How to use g023/qwen3-tiny-v2 with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull g023/qwen3-tiny-v2:Q4_K_M
Run and chat with the model
lemonade run user.qwen3-tiny-v2-Q4_K_M
List all available models
lemonade list
Qwen3-g023-tiny-v2 — GGUF
An advanced 30-layer Qwen3 variant using swap, interpolation, and skip-bridge surgery.
Created through innovative layer surgery combining multi-swap, interpolation, and bridge (skip connection) techniques. Scores 94.3/100 — a 6.5-point improvement over the original Qwen3-1.7B baseline (87.8/100) and the highest score achieved in two phases of experimentation across ~250 configurations. (I have my own benchmarks, so results may vary if you run your own tests.)
Available Quantizations
| Quantization | Bits/Weight | Description | Download |
|---|---|---|---|
| Q8_0 | 8.00 | Highest quality, virtually lossless (USE THIS ONE) | Qwen3-g023-tiny-v2-Q8_0.gguf |
| Q6_K | 6.57 | Excellent quality, good compression | Qwen3-g023-tiny-v2-Q6_K.gguf |
| Q4_K_M | 4.85 | Good balance of quality and size | Qwen3-g023-tiny-v2-Q4_K_M.gguf |
| Q3_K_M | 3.91 | High compression, moderate quality loss | Qwen3-g023-tiny-v2-Q3_K_M.gguf |
| Q2_K | 3.35 | Maximum compression, significant quality loss | Qwen3-g023-tiny-v2-Q2_K.gguf |
Model Details
| Parameter | Value |
|---|---|
| Architecture | Qwen3ForCausalLM |
| Layers | 30 (28 original + 2 from surgery) |
| Hidden Size | 2,048 |
| Intermediate Size | 6,144 |
| Attention Heads | 16 query / 8 key-value (GQA) |
| Head Dimension | 128 |
| Vocabulary | 151,936 tokens |
| Max Context | 40,960 tokens |
| RoPE θ | 1,000,000 |
| Tied Embeddings | Yes |
| Total Parameters | ~1.82B |
| Precision (source) | bfloat16 |
Surgery Operations
This model was created by applying three innovative surgical operations to Qwen/Qwen3-1.7B:
- Multi-swap: layers 12↔13 and 16↔17 — Reorders attention blocks at two critical points in the network for improved representational flow through the mid-layers.
- Interpolation: layers 20 & 22 (α=0.5) — Creates a new layer by blending the weights of layers 20 and 22 at equal proportions, producing a smoother transition in the upper layers.
- Bridge (skip connection): layer 5 → after layer 20 — Copies early-layer representations (layer 5) and inserts them after layer 20, creating a skip connection that helps preserve low-level features deep in the network.
Why These Operations Work
- Multi-swap corrects suboptimal layer ordering that emerged from pre-training, allowing better gradient flow through the network's critical middle section.
- Interpolation creates a synthetic transition layer that smooths the representation gap between layers 20 and 22, reducing the information bottleneck.
- Bridge/skip connections address the "forgetting problem" in deep networks by reintroducing early feature representations at later stages — a technique inspired by ResNet's residual connections but applied at the transformer layer level.
Benchmark Results
| Metric | Original (28L) | v1 (27L) | v2 (30L) | Δ vs Original |
|---|---|---|---|---|
| Overall Score | 87.8 / 100 | 92.9 / 100 | 94.3 / 100 | +6.5 |
| Factual Accuracy | 15/17 (88%) | 17/17 (100%) | 16/17 (94%) | +6% |
| Avg Perplexity | — | 15.70 | 15.17 | — |
| Thinking Mode | ✅ | ✅ | ✅ | — |
| Non-Thinking Mode | ✅ | ✅ | ✅ | — |
Evaluated using a comprehensive test suite with 17 factual questions, 2 completion coherence tests, perplexity measurements, repetition analysis, and thinking/non-thinking mode verification.
Features
- Thinking mode: Full
<think>/</think>reasoning support — toggle viaenable_thinkingparameter - Non-thinking mode: Direct responses without chain-of-thought overhead
- Tool calling: Full function/tool calling support
- System prompts: Standard system message support
- Chat template: Qwen3 ChatML template embedded in the GGUF
Usage
With Ollama
# Download the GGUF and create from Modelfile
cat > Modelfile << 'EOF'
FROM ./Qwen3-g023-tiny-v2-Q8_0.gguf
PARAMETER temperature 1.0
PARAMETER top_p 0.95
PARAMETER top_k 45
PARAMETER min_p 0.1
PARAMETER num_ctx 40000
PARAMETER mirostat 2
PARAMETER mirostat_tau 5.0
PARAMETER mirostat_eta 0.1
PARAMETER repeat_last_n 16384
PARAMETER repeat_penalty 1.1
PARAMETER presence_penalty 0.5
PARAMETER frequency_penalty 1.0
TEMPLATE """{{- if .System }}
<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}
{{- range .Messages }}
{{- if eq .Role "user" }}
<|im_start|>user
{{ .Content }}<|im_end|>
{{- else if eq .Role "assistant" }}
<|im_start|>assistant
{{ .Content }}<|im_end|>
{{- end }}
{{- end }}
<|im_start|>assistant
"""
SYSTEM "You are a helpful assistant."
EOF
ollama create qwen3-tiny-v2 -f Modelfile
ollama run qwen3-tiny-v2
With llama.cpp
# Interactive chat
llama-cli -m Qwen3-g023-tiny-v2-Q8_0.gguf \
--chat-template chatml -cnv
# Thinking mode
llama-cli -m Qwen3-g023-tiny-v2-Q8_0.gguf \
-p "<|im_start|>user\nExplain quantum computing<|im_end|>\n<|im_start|>assistant\n<think>\n" \
-n 512
# Non-thinking mode
llama-cli -m Qwen3-g023-tiny-v2-Q8_0.gguf \
-p "<|im_start|>user\n/no_think What is 2+2?<|im_end|>\n<|im_start|>assistant\n" \
-n 128
With Python (llama-cpp-python)
from llama_cpp import Llama
model = Llama("Qwen3-g023-tiny-v2-Q8_0.gguf", n_ctx=4096)
response = model.create_chat_completion(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
],
temperature=0.6,
)
print(response["choices"][0]["message"]["content"])
System Requirements
| Quantization | RAM (CPU) | VRAM (GPU) |
|---|---|---|
| Q8_0 | ~2.2 GB | ~2.2 GB |
| Q6_K | ~1.8 GB | ~1.8 GB |
| Q4_K_M | ~1.4 GB | ~1.4 GB |
| Q3_K_M | ~1.2 GB | ~1.2 GB |
| Q2_K | ~1.0 GB | ~1.0 GB |
v1 vs v2
This model (v2) is the Phase 2 champion, using advanced multi-operation surgery for the highest overall score.
| v1 | v2 (this model) | |
|---|---|---|
| Layers | 27 | 30 |
| Parameters | ~1.67B | ~1.82B |
| Operations | del + swap | swap + interpolate + bridge |
| Score | 92.9 / 100 | 94.3 / 100 |
| Factual | 100% (17/17) | 94% (16/17) |
| Perplexity | 15.70 | 15.17 |
| Use Case | Max factual accuracy | Max overall score |
v1 is recommended when factual accuracy is paramount (100% vs 94%). v2 is recommended when overall quality matters more (94.3 vs 92.9).
Methodology
Layer surgery was performed through a systematic, test-driven process across two phases:
- Phase 1 (~150 configs): Exhaustive search across deletion, duplication, swapping, interpolation, and combined operations → champion: del_10 + swap_11↔12 (v1)
- Phase 2 (~95 configs): Advanced techniques including tripling, multi-swap, layer reversal, cycling, weight scaling, layer merging, skip bridges, and synthesis → champion: this model (v2)
- Evaluation: Each configuration scored on factual accuracy (17 questions), completion coherence, perplexity, repetition ratio, and thinking mode functionality
Phase 2 Leaderboard (Top 5)
| Rank | Configuration | Score | Factual | PPL |
|---|---|---|---|---|
| 🥇 | swap(12↔13,16↔17) + interp(20↔22) + bridge(5→20) | 94.3 | 94% | 15.17 |
| 🥈 | swap(12↔13,16↔17) + interp(20↔22) | 93.9 | 94% | 14.74 |
| 🥉 | swap(12↔13) + interp(20↔22) + bridge(5→20) | 93.4 | 94% | 15.66 |
| 4 | multi-swap(12↔13,16↔17) | 93.1 | 100% | 14.90 |
| 5 | Phase 1 champion (del_10 + swap_11↔12) | 92.9 | 100% | 15.70 |
Credits
- Base model: Qwen/Qwen3-1.7B by the Qwen team at Alibaba
- Quantization: llama.cpp
- Surgery: g023
License
Apache 2.0 — same as the original Qwen3-1.7B model.
- Downloads last month
- 78
2-bit
3-bit
4-bit
6-bit
8-bit