SmolLM2-1.7B — Scheduled QAT (Linear Schedule) — GGUF

GGUF quantized versions of SmolLM2-1.7B, trained with Scheduled Quantization-Aware Training (linear bit-width reduction schedule) before quantization.

Key insight: Unlike naive Post-Training Quantization (PTQ), these weights were specifically trained to survive quantization. During training, precision was gradually reduced from FP32 → FP16 → INT8 → INT4 following a linear schedule, allowing the model to adapt its weights to quantization noise at each stage.

Files

Filename Quant Size BPW Description
smollm2-1.7b-sched-qat-linear-Q4_K_M.gguf Q4_K_M ~1.0 GB 4.93 Recommended — best quality/size ratio for edge deployment
smollm2-1.7b-sched-qat-linear-Q8_0.gguf Q8_0 ~1.7 GB 8.50 Higher quality, larger size

Training Details

Parameter Value
Base model HuggingFaceTB/SmolLM2-1.7B
Method Scheduled QAT (Linear bit-width reduction)
Training data WikiText-103 (4000 sequences × 512 tokens)
Hardware Kaggle TPU v5e-8 (8 cores)
Epochs 1
Effective batch size 64 (4 per-core × 2 grad accum × 8 cores)
Learning rate 2e-5 (cosine decay)
Optimizer AdamW (weight_decay=0.01)
Training time ~1150 seconds

Bit-Width Schedule

Epoch:  0.0 ──── 0.1 ──────────── 0.9 ──── 1.0
Bits:   FP32      │    Linear     │   INT4
        (warmup)  │   32→16→8→4   │  (stabilize)
Phase Epoch Range Bit-width
Warmup 0.0 → 0.1 FP32 (no quantization noise)
Linear reduction 0.1 → 0.9 32 → 16 → 8 → 4 (gradual)
Stabilization 0.9 → 1.0 INT4 (final fine-tuning)

QAT Training Results (WikiText-103 Test)

Metric Value
Test loss 3.0392
Test perplexity 20.89

How It Works

  1. Training (QAT): Model weights are trained with fake quantization nodes that simulate INT4 rounding noise in every forward pass. The gradients learn to place weights near quantization grid points.
  2. Export (this repo): The QAT-hardened bf16 weights are converted to GGUF format and quantized to actual INT4/INT8 using llama.cpp's llama-quantize.
  3. Deployment: The GGUF files run directly on edge devices via llama.cpp — Android, iOS, Raspberry Pi.

Because QAT pre-adapted the weights for quantization, these GGUF files should retain more quality than naively quantized (PTQ) versions of the same model.

Usage

With llama.cpp CLI

# Download
wget https://huggingface.co/jpcurada/SmolLM2-1.7B-Scheduled-QAT-Linear-GGUF/resolve/main/smollm2-1.7b-sched-qat-linear-Q4_K_M.gguf

# Run
./llama-cli -m smollm2-1.7b-sched-qat-linear-Q4_K_M.gguf \
    -p "The future of artificial intelligence is" -n 100

With llama-cpp-python

from llama_cpp import Llama

llm = Llama(model_path="smollm2-1.7b-sched-qat-linear-Q4_K_M.gguf")
output = llm("The future of AI is", max_tokens=100)
print(output["choices"][0]["text"])

Related

Citation

This model is part of a thesis on Scheduled Quantization-Aware Training for Small Language Models targeting edge deployment.

License

Apache 2.0 (same as base model)

Downloads last month
94
GGUF
Model size
2B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jpcurada/SmolLM2-1.7B-Scheduled-QAT-Linear-GGUF

Quantized
(38)
this model

Dataset used to train jpcurada/SmolLM2-1.7B-Scheduled-QAT-Linear-GGUF