smollm3-int4-ov / README.md

Update README.md

29f8c2d verified about 2 months ago

3.96 kB

	---
	library_name: transformers
	license: apache-2.0
	language:
	- en
	- fr
	- es
	- it
	- pt
	- zh
	- ar
	- ru
	base_model:
	- HuggingFaceTB/SmolLM3-3B-Base
	tags:
	- openvino
	- int4
	- quantization
	- edge-deployment
	- optimization
	- smollm3
	inference: false
	---

	# SmolLM3 INT4 OpenVINO

	## 🚀 Optimized for Edge Deployment

	This is an INT4 quantized version of [SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) using OpenVINO, designed for efficient inference on edge devices and CPUs.

	## Model Overview

	- Base Model: SmolLM3-3B
	- Quantization: INT4 via OpenVINO
	- Size Reduction: Significant compression achieved
	- Target Hardware: CPUs, Intel GPUs, NPUs
	- Use Cases: Local inference, edge deployment, resource-constrained environments

	## 🔧 Technical Details

	### Quantization Process
	```python
	# Quantized using OpenVINO NNCF
	# INT4 symmetric quantization
	# Calibration dataset: [specify if used]
	```

	### Model Architecture
	- Same architecture as SmolLM3-3B
	- GQA and NoPE preserved
	- 64k context support (128k with YARN)
	- Multilingual capabilities maintained

	## 📊 Performance (Experimental)

	> ⚠️ Note: This is an experimental quantization. Formal benchmarks pending.

	Expected benefits of INT4 quantization:
	- Reduced model size
	- Faster CPU inference
	- Lower memory requirements
	- Some quality trade-off

	Actual metrics will be added after proper benchmarking.

	## 🛠️ How to Use

	### Installation
	```bash
	pip install optimum[openvino] transformers
	```

	### Basic Usage
	```python
	from optimum.intel import OVModelForCausalLM
	from transformers import AutoTokenizer

	model_id = "dev-bjoern/smollm3-int4-ov"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = OVModelForCausalLM.from_pretrained(model_id)

	# Generate text
	prompt = "Explain quantum computing in simple terms"
	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=100)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	### With Extended Thinking
	```python
	messages = [
	{"role": "system", "content": "/think"},
	{"role": "user", "content": "Solve this step by step: 25 * 16"}
	]

	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)
	```

	## 🎯 Intended Use

	- Edge AI applications
	- Local LLM deployment
	- Resource-constrained environments
	- Privacy-focused applications
	- Offline AI assistants

	## ⚡ Optimization Tips

	1. CPU Inference: Use OpenVINO runtime for best performance
	2. Batch Processing: Consider batching requests when possible
	3. Memory: INT4 significantly reduces memory requirements

	## 🧪 Experimental Status

	This is my first experiment with OpenVINO INT4 quantization. Feedback and contributions are welcome!

	### Known Limitations
	- No formal benchmarks yet
	- Quantization settings not fully optimized
	- Some quality degradation vs full precision

	### Future Improvements
	- [ ] Comprehensive benchmarking
	- [ ] Mixed precision experiments
	- [ ] Model compression analysis
	- [ ] Calibration dataset optimization

	## 🤝 Contributing

	Found issues or have suggestions? Please open a discussion or issue!

	## 📚 Resources

	- [Original SmolLM3 Model](https://huggingface.co/HuggingFaceTB/SmolLM3-3B)
	- [OpenVINO Documentation](https://docs.openvino.ai/)
	- [Optimum Intel](https://huggingface.co/docs/optimum/intel/index)

	## 🙏 Acknowledgments

	- HuggingFace team for SmolLM3
	- Intel OpenVINO team for quantization tools
	- Community for feedback and support

	## 📝 Citation

	If you use this model, please cite both the original and this work:

	```bibtex
	@misc{smollm3-int4-ov,
	author = {Bjoern Bethge},
	title = {SmolLM3 INT4 OpenVINO},
	year = {2024},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/dev-bjoern/smollm3-int4-ov}}
	}
	```

	---

	Status: 🧪 Experimental \| Feedback: Welcome \| License: Apache 2.0