smollm3-int4-ov / README.md
dev-bjoern's picture
Update README.md
29f8c2d verified
---
library_name: transformers
license: apache-2.0
language:
- en
- fr
- es
- it
- pt
- zh
- ar
- ru
base_model:
- HuggingFaceTB/SmolLM3-3B-Base
tags:
- openvino
- int4
- quantization
- edge-deployment
- optimization
- smollm3
inference: false
---
# SmolLM3 INT4 OpenVINO
## 🚀 Optimized for Edge Deployment
This is an INT4 quantized version of [SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) using OpenVINO, designed for efficient inference on edge devices and CPUs.
## Model Overview
- **Base Model:** SmolLM3-3B
- **Quantization:** INT4 via OpenVINO
- **Size Reduction:** Significant compression achieved
- **Target Hardware:** CPUs, Intel GPUs, NPUs
- **Use Cases:** Local inference, edge deployment, resource-constrained environments
## 🔧 Technical Details
### Quantization Process
```python
# Quantized using OpenVINO NNCF
# INT4 symmetric quantization
# Calibration dataset: [specify if used]
```
### Model Architecture
- Same architecture as SmolLM3-3B
- GQA and NoPE preserved
- 64k context support (128k with YARN)
- Multilingual capabilities maintained
## 📊 Performance (Experimental)
> ⚠️ **Note:** This is an experimental quantization. Formal benchmarks pending.
Expected benefits of INT4 quantization:
- Reduced model size
- Faster CPU inference
- Lower memory requirements
- Some quality trade-off
Actual metrics will be added after proper benchmarking.
## 🛠️ How to Use
### Installation
```bash
pip install optimum[openvino] transformers
```
### Basic Usage
```python
from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer
model_id = "dev-bjoern/smollm3-int4-ov"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = OVModelForCausalLM.from_pretrained(model_id)
# Generate text
prompt = "Explain quantum computing in simple terms"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
### With Extended Thinking
```python
messages = [
{"role": "system", "content": "/think"},
{"role": "user", "content": "Solve this step by step: 25 * 16"}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
```
## 🎯 Intended Use
- **Edge AI applications**
- **Local LLM deployment**
- **Resource-constrained environments**
- **Privacy-focused applications**
- **Offline AI assistants**
## ⚡ Optimization Tips
1. **CPU Inference:** Use OpenVINO runtime for best performance
2. **Batch Processing:** Consider batching requests when possible
3. **Memory:** INT4 significantly reduces memory requirements
## 🧪 Experimental Status
This is my first experiment with OpenVINO INT4 quantization. Feedback and contributions are welcome!
### Known Limitations
- No formal benchmarks yet
- Quantization settings not fully optimized
- Some quality degradation vs full precision
### Future Improvements
- [ ] Comprehensive benchmarking
- [ ] Mixed precision experiments
- [ ] Model compression analysis
- [ ] Calibration dataset optimization
## 🤝 Contributing
Found issues or have suggestions? Please open a discussion or issue!
## 📚 Resources
- [Original SmolLM3 Model](https://huggingface.co/HuggingFaceTB/SmolLM3-3B)
- [OpenVINO Documentation](https://docs.openvino.ai/)
- [Optimum Intel](https://huggingface.co/docs/optimum/intel/index)
## 🙏 Acknowledgments
- HuggingFace team for SmolLM3
- Intel OpenVINO team for quantization tools
- Community for feedback and support
## 📝 Citation
If you use this model, please cite both the original and this work:
```bibtex
@misc{smollm3-int4-ov,
author = {Bjoern Bethge},
title = {SmolLM3 INT4 OpenVINO},
year = {2024},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/dev-bjoern/smollm3-int4-ov}}
}
```
---
**Status:** 🧪 Experimental | **Feedback:** Welcome | **License:** Apache 2.0