File size: 3,955 Bytes
4f0eba9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29f8c2d
4f0eba9
29f8c2d
4f0eba9
 
 
 
 
 
 
 
00bdc52
 
 
 
 
 
 
 
 
 
 
 
 
 
29f8c2d
 
 
 
 
 
 
00bdc52
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29f8c2d
 
00bdc52
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
---
library_name: transformers
license: apache-2.0
language:
- en
- fr
- es
- it
- pt
- zh
- ar
- ru
base_model:
  - HuggingFaceTB/SmolLM3-3B-Base
tags:
- openvino
- int4
- quantization
- edge-deployment
- optimization
- smollm3
inference: false
---

# SmolLM3 INT4 OpenVINO

## 🚀 Optimized for Edge Deployment

This is an INT4 quantized version of [SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) using OpenVINO, designed for efficient inference on edge devices and CPUs.

## Model Overview

- **Base Model:** SmolLM3-3B
- **Quantization:** INT4 via OpenVINO
- **Size Reduction:** Significant compression achieved
- **Target Hardware:** CPUs, Intel GPUs, NPUs
- **Use Cases:** Local inference, edge deployment, resource-constrained environments

## 🔧 Technical Details

### Quantization Process
```python
# Quantized using OpenVINO NNCF
# INT4 symmetric quantization
# Calibration dataset: [specify if used]
```

### Model Architecture
- Same architecture as SmolLM3-3B
- GQA and NoPE preserved
- 64k context support (128k with YARN)
- Multilingual capabilities maintained

## 📊 Performance (Experimental)

> ⚠️ **Note:** This is an experimental quantization. Formal benchmarks pending.

Expected benefits of INT4 quantization:
- Reduced model size
- Faster CPU inference
- Lower memory requirements
- Some quality trade-off

Actual metrics will be added after proper benchmarking.

## 🛠️ How to Use

### Installation
```bash
pip install optimum[openvino] transformers
```

### Basic Usage
```python
from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer

model_id = "dev-bjoern/smollm3-int4-ov"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = OVModelForCausalLM.from_pretrained(model_id)

# Generate text
prompt = "Explain quantum computing in simple terms"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

### With Extended Thinking
```python
messages = [
    {"role": "system", "content": "/think"},
    {"role": "user", "content": "Solve this step by step: 25 * 16"}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
```

## 🎯 Intended Use

- **Edge AI applications**
- **Local LLM deployment**
- **Resource-constrained environments**
- **Privacy-focused applications**
- **Offline AI assistants**

## ⚡ Optimization Tips

1. **CPU Inference:** Use OpenVINO runtime for best performance
2. **Batch Processing:** Consider batching requests when possible
3. **Memory:** INT4 significantly reduces memory requirements

## 🧪 Experimental Status

This is my first experiment with OpenVINO INT4 quantization. Feedback and contributions are welcome!

### Known Limitations
- No formal benchmarks yet
- Quantization settings not fully optimized
- Some quality degradation vs full precision

### Future Improvements
- [ ] Comprehensive benchmarking
- [ ] Mixed precision experiments
- [ ] Model compression analysis
- [ ] Calibration dataset optimization

## 🤝 Contributing

Found issues or have suggestions? Please open a discussion or issue!

## 📚 Resources

- [Original SmolLM3 Model](https://huggingface.co/HuggingFaceTB/SmolLM3-3B)
- [OpenVINO Documentation](https://docs.openvino.ai/)
- [Optimum Intel](https://huggingface.co/docs/optimum/intel/index)

## 🙏 Acknowledgments

- HuggingFace team for SmolLM3
- Intel OpenVINO team for quantization tools
- Community for feedback and support

## 📝 Citation

If you use this model, please cite both the original and this work:

```bibtex
@misc{smollm3-int4-ov,
  author = {Bjoern Bethge},
  title = {SmolLM3 INT4 OpenVINO},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/dev-bjoern/smollm3-int4-ov}}
}
```

---

**Status:** 🧪 Experimental | **Feedback:** Welcome | **License:** Apache 2.0