File size: 3,955 Bytes

---
library_name: transformers
license: apache-2.0
language:
- en
- fr
- es
- it
- pt
- zh
- ar
- ru
base_model:
  - HuggingFaceTB/SmolLM3-3B-Base
tags:
- openvino
- int4
- quantization
- edge-deployment
- optimization
- smollm3
inference: false
---

# SmolLM3 INT4 OpenVINO

## 🚀 Optimized for Edge Deployment

This is an INT4 quantized version of [SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) using OpenVINO, designed for efficient inference on edge devices and CPUs.

## Model Overview

- **Base Model:** SmolLM3-3B
- **Quantization:** INT4 via OpenVINO
- **Size Reduction:** Significant compression achieved
- **Target Hardware:** CPUs, Intel GPUs, NPUs
- **Use Cases:** Local inference, edge deployment, resource-constrained environments

## 🔧 Technical Details

### Quantization Process
```python
# Quantized using OpenVINO NNCF
# INT4 symmetric quantization
# Calibration dataset: [specify if used]
```

### Model Architecture
- Same architecture as SmolLM3-3B
- GQA and NoPE preserved
- 64k context support (128k with YARN)
- Multilingual capabilities maintained

## 📊 Performance (Experimental)

> ⚠️ **Note:** This is an experimental quantization. Formal benchmarks pending.

Expected benefits of INT4 quantization:
- Reduced model size
- Faster CPU inference
- Lower memory requirements
- Some quality trade-off

Actual metrics will be added after proper benchmarking.

## 🛠️ How to Use

### Installation
```bash
pip install optimum[openvino] transformers
```

### Basic Usage
```python
from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer

model_id = "dev-bjoern/smollm3-int4-ov"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = OVModelForCausalLM.from_pretrained(model_id)

# Generate text
prompt = "Explain quantum computing in simple terms"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

### With Extended Thinking
```python
messages = [
    {"role": "system", "content": "/think"},
    {"role": "user", "content": "Solve this step by step: 25 * 16"}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
```

## 🎯 Intended Use

- **Edge AI applications**
- **Local LLM deployment**
- **Resource-constrained environments**
- **Privacy-focused applications**
- **Offline AI assistants**

## ⚡ Optimization Tips

1. **CPU Inference:** Use OpenVINO runtime for best performance
2. **Batch Processing:** Consider batching requests when possible
3. **Memory:** INT4 significantly reduces memory requirements

## 🧪 Experimental Status

This is my first experiment with OpenVINO INT4 quantization. Feedback and contributions are welcome!

### Known Limitations
- No formal benchmarks yet
- Quantization settings not fully optimized
- Some quality degradation vs full precision

### Future Improvements
- [ ] Comprehensive benchmarking
- [ ] Mixed precision experiments
- [ ] Model compression analysis
- [ ] Calibration dataset optimization

## 🤝 Contributing

Found issues or have suggestions? Please open a discussion or issue!

## 📚 Resources

- [Original SmolLM3 Model](https://huggingface.co/HuggingFaceTB/SmolLM3-3B)
- [OpenVINO Documentation](https://docs.openvino.ai/)
- [Optimum Intel](https://huggingface.co/docs/optimum/intel/index)

## 🙏 Acknowledgments

- HuggingFace team for SmolLM3
- Intel OpenVINO team for quantization tools
- Community for feedback and support

## 📝 Citation

If you use this model, please cite both the original and this work:

```bibtex
@misc{smollm3-int4-ov,
  author = {Bjoern Bethge},
  title = {SmolLM3 INT4 OpenVINO},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/dev-bjoern/smollm3-int4-ov}}
}
```

---

**Status:** 🧪 Experimental | **Feedback:** Welcome | **License:** Apache 2.0