SmolLM2-360M - Dpo
Model Description
This model is a LoRA Adapter fine-tuned from HuggingFaceTB/SmolLM2-360M using the Dpo training method.
Direct Preference Optimization - Paired preference optimization without reference model
This model was developed as part of thesis research on LLM Alignment using Preference Optimization Methods.
Model Details
| Property | Value |
|---|---|
| Base Model | HuggingFaceTB/SmolLM2-360M |
| Training Method | Dpo |
| Model Type | LoRA Adapter |
| Training Date | December 2025 |
| Framework | PyTorch + Transformers + PEFT |
Benchmark Results
| Benchmark | Score |
|---|---|
| HellaSwag (10-shot) | 0.550 |
| TruthfulQA (0-shot MC2) | 0.361 |
| MMLU-Mini (5-shot) | 0.264 |
Comparative Analysis
The following chart compares this method against other training approaches on the same base model:
Training Configuration
| Parameter | Value |
|---|---|
| Epochs | 1 |
| Batch Size | 2 |
| Gradient Accumulation | 8 |
| Effective Batch Size | 16 |
| Learning Rate | 2e-4 |
| Max Sequence Length | 512 |
| LoRA Rank | 16 |
| LoRA Alpha | 32 |
| Dataset | Combined Preference Dataset (HH-RLHF + SHP + OpenAssistant) |
Combined Preference Dataset (kto_combined)
Training uses a Combined Preference Dataset built via Round-Robin Sampling from three sources:
| Source | Total Samples | Interactions |
|---|---|---|
| Anthropic HH-RLHF | 321,600 | 61,568 |
| Stanford Human Preferences (SHP) | 697,436 | 38,984 |
| OpenAssistant Conversations v1 | 16,810 | 8,904 |
| Total | 1,035,846 | 109,456 |
Actual Training Statistics (subset split train_prefs[:32090]):
- Training samples: 13,300 (paired examples)
- Validation samples: 700 (5%)
- Round-Robin distribution: 1,130 interactions per source
- Seed: 42 (for reproducibility)
Usage
Loading as LoRA Adapter
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load base model
base_model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-360M")
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-360M")
# Load adapter
model = PeftModel.from_pretrained(base_model, "Nishef/SmolLM2-360M-Full_DPO_20251225_043457")
# Generate text
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))
Training Methodology
Dpo
Direct Preference Optimization - Paired preference optimization without reference model
Key Features:
- Paired preference optimization
- Direct policy optimization without reward model
- Efficient single-stage training
- Bradley-Terry preference modeling
Citation
If you use this model in your research, please cite:
@misc{smollm2_360m_dpo_2025,
title = {SmolLM2-360M Fine-tuned with Dpo},
author = {Thesis Research},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/Nishef/SmolLM2-360M-Full_DPO_20251225_043457}
}
Repository Structure
.
βββ adapter_config.json # LoRA configuration
βββ adapter_model.safetensors # Model weights
βββ tokenizer files # Tokenizer configuration
βββ eval_summary.csv # Evaluation results
βββ thesis_plots/ # Visualization assets
β βββ benchmark_results.png
β βββ training_loss.png
βββ README.md # This file
Acknowledgments
- Base Model: HuggingFaceTB/SmolLM2-360M
- Training Framework: Hugging Face Transformers
- Fine-tuning Library: PEFT
License
This model is released under the Apache 2.0 license.
This model was created as part of thesis research on LLM alignment using preference optimization methods.
Model tree for Nishef/SmolLM2-360M-Full_DPO_20251225_043457
Base model
HuggingFaceTB/SmolLM2-360M