update evaluation section and remove extra space
Browse files
README.md
CHANGED
|
@@ -164,7 +164,7 @@ You can specify custom instruction through the system prompt while controlling w
|
|
| 164 |
```python
|
| 165 |
prompt = "Give me a brief explanation of gravity in simple terms."
|
| 166 |
messages = [
|
| 167 |
-
{"role": "system", "content": "Speak like a pirate
|
| 168 |
{"role": "user", "content": prompt}
|
| 169 |
]
|
| 170 |
|
|
@@ -179,11 +179,42 @@ For local inference, you can use `llama.cpp`, `ONNX`, `MLX` and `MLC`. You can f
|
|
| 179 |
|
| 180 |
## Evaluation
|
| 181 |
|
| 182 |
-
In this section, we report the evaluation results of SmolLM3 base model. All evaluations are zero-shot unless stated otherwise, and we use [lighteval](https://github.com/huggingface/lighteval) to run them.
|
| 183 |
|
| 184 |
We highlight the best score in bold and underline the second-best score.
|
| 185 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 186 |
## Base Pre-Trained Model
|
|
|
|
| 187 |
|
| 188 |
### English benchmarks
|
| 189 |
Note: All evaluations are zero-shot unless stated otherwise.
|
|
@@ -212,7 +243,6 @@ Note: All evaluations are zero-shot unless stated otherwise.
|
|
| 212 |
### Multilingual benchmarks
|
| 213 |
|
| 214 |
|
| 215 |
-
|
| 216 |
| Category | Metric | SmolLM3 3B Base | Qwen2.5-3B | Llama3.2 3B | Qwen3 1.7B Base | Qwen3 4B Base |
|
| 217 |
|---------|--------|---------------------|------------|--------------|------------------|---------------|
|
| 218 |
| Main supported languages | | | | | | | |
|
|
@@ -251,35 +281,6 @@ The model has also been trained on Arabic (standard), Chinese and Russian data,
|
|
| 251 |
| | Global MMLU (CF) | <u>36.51</u> | 32.47 | 34.52 | 34.83 | **38.80** |
|
| 252 |
| | Flores-200 (5-shot) | 47.13 | 48.74 | 50.74 | <u>54.70</u> | **60.53** |
|
| 253 |
|
| 254 |
-
|
| 255 |
-
## Instruction Model
|
| 256 |
-
|
| 257 |
-
### No Extended Thinking
|
| 258 |
-
Evaluation results of non reasoning models and reasoning models in no thinking mode. We highlight the best and second-best scores in bold.
|
| 259 |
-
| Category | Metric | SmoLLM3-3B | Qwen2.5-3B | Llama3.1-3B | Qwen3-1.7B | Qwen3-4B |
|
| 260 |
-
|---------|--------|------------|------------|-------------|------------|----------|
|
| 261 |
-
| High school math competition | AIME 2025 | <u>9.3</u> | 2.9 | 0.3 | 8.0 | **17.1** |
|
| 262 |
-
| Math problem-solving | GSM-Plus | 72.8 | <u>74.1</u> | 59.2 | 68.3 | **82.1** |
|
| 263 |
-
| Competitive programming | LiveCodeBench v4 | <u>15.2</u> | 10.5 | 3.4 | 15.0 | **24.9** |
|
| 264 |
-
| Graduate-level reasoning | GPQA Diamond | <u>35.7</u> | 32.2 | 29.4 | 31.8 | **44.4** |
|
| 265 |
-
| Instruction following | IFEval | **76.7** | 65.6 | 71.6 | <u>74.0</u> | 68.9 |
|
| 266 |
-
| Alignment | MixEval Hard | 26.9 | <u>27.6</u> | 24.9 | 24.3 | **31.6** |
|
| 267 |
-
| Knowledge | MMLU-Pro | 45.0 | 41.9 | 36.6 | <u>45.6</u> | **60.9** |
|
| 268 |
-
| Multilingual Q&A | Global MMLU | <u>53.5</u> | 50.54 | 46.8 | 49.5 | **65.1** |
|
| 269 |
-
|
| 270 |
-
### Extended Thinking
|
| 271 |
-
Evaluation results in reasoning mode for SmolLM3 and Qwen3 models:
|
| 272 |
-
| Category | Metric | SmoLLM3-3B | Qwen3-1.7B | Qwen3-4B |
|
| 273 |
-
|---------|--------|------------|------------|----------|
|
| 274 |
-
| High school math competition | AIME 2025 | <u>36.7</u> | 30.7 | **58.8** |
|
| 275 |
-
| Math problem-solving | GSM-Plus | <u>83.4</u> | 79.4 | **88.2** |
|
| 276 |
-
| Competitive programming | LiveCodeBench v4 | 30.0 | <u>34.4</u> | **52.9** |
|
| 277 |
-
| Graduate-level reasoning | GPQA Diamond | <u>41.7</u> | 39.9 | **55.3** |
|
| 278 |
-
| Instruction following | IFEval | 71.2 | <u>74.2</u> | **85.4** |
|
| 279 |
-
| Alignment | MixEval Hard | 30.8 | <u>33.9</u> | **38.0** |
|
| 280 |
-
| Knowledge | MMLU-Pro | <u>58.4</u> | 57.8 | **70.2** |
|
| 281 |
-
| Multilingual Q&A | Global MMLU | <u>64.1</u> | 62.3 | **73.3** |
|
| 282 |
-
|
| 283 |
## Training
|
| 284 |
|
| 285 |
### Model
|
|
|
|
| 164 |
```python
|
| 165 |
prompt = "Give me a brief explanation of gravity in simple terms."
|
| 166 |
messages = [
|
| 167 |
+
{"role": "system", "content": "Speak like a pirate./think"},
|
| 168 |
{"role": "user", "content": prompt}
|
| 169 |
]
|
| 170 |
|
|
|
|
| 179 |
|
| 180 |
## Evaluation
|
| 181 |
|
| 182 |
+
In this section, we report the evaluation results of SmolLM3 base model. All evaluations are zero-shot unless stated otherwise, and we use [lighteval](https://github.com/huggingface/lighteval) to run them.
|
| 183 |
|
| 184 |
We highlight the best score in bold and underline the second-best score.
|
| 185 |
|
| 186 |
+
## Instruction Model
|
| 187 |
+
|
| 188 |
+
### No Extended Thinking
|
| 189 |
+
Evaluation results of non reasoning models and reasoning models in no thinking mode. We highlight the best and second-best scores in bold.
|
| 190 |
+
| Category | Metric | SmoLLM3-3B | Qwen2.5-3B | Llama3.1-3B | Qwen3-1.7B | Qwen3-4B |
|
| 191 |
+
|---------|--------|------------|------------|-------------|------------|----------|
|
| 192 |
+
| High school math competition | AIME 2025 | <u>9.3</u> | 2.9 | 0.3 | 8.0 | **17.1** |
|
| 193 |
+
| Math problem-solving | GSM-Plus | 72.8 | <u>74.1</u> | 59.2 | 68.3 | **82.1** |
|
| 194 |
+
| Competitive programming | LiveCodeBench v4 | <u>15.2</u> | 10.5 | 3.4 | 15.0 | **24.9** |
|
| 195 |
+
| Graduate-level reasoning | GPQA Diamond | <u>35.7</u> | 32.2 | 29.4 | 31.8 | **44.4** |
|
| 196 |
+
| Instruction following | IFEval | **76.7** | 65.6 | 71.6 | <u>74.0</u> | 68.9 |
|
| 197 |
+
| Alignment | MixEval Hard | 26.9 | <u>27.6</u> | 24.9 | 24.3 | **31.6** |
|
| 198 |
+
| Tool Calling | BFCL| <u>92.3</u> | - | <u>92.3</u> * | 89.5 | **95.0** |
|
| 199 |
+
| Multilingual Q&A | Global MMLU | <u>53.5</u> | 50.54 | 46.8 | 49.5 | **65.1** |
|
| 200 |
+
(*): this is a tool calling finetune
|
| 201 |
+
|
| 202 |
+
### Extended Thinking
|
| 203 |
+
Evaluation results in reasoning mode for SmolLM3 and Qwen3 models:
|
| 204 |
+
| Category | Metric | SmoLLM3-3B | Qwen3-1.7B | Qwen3-4B |
|
| 205 |
+
|---------|--------|------------|------------|----------|
|
| 206 |
+
| High school math competition | AIME 2025 | <u>36.7</u> | 30.7 | **58.8** |
|
| 207 |
+
| Math problem-solving | GSM-Plus | <u>83.4</u> | 79.4 | **88.2** |
|
| 208 |
+
| Competitive programming | LiveCodeBench v4 | 30.0 | <u>34.4</u> | **52.9** |
|
| 209 |
+
| Graduate-level reasoning | GPQA Diamond | <u>41.7</u> | 39.9 | **55.3** |
|
| 210 |
+
| Instruction following | IFEval | 71.2 | <u>74.2</u> | **85.4** |
|
| 211 |
+
| Alignment | MixEval Hard | 30.8 | <u>33.9</u> | **38.0** |
|
| 212 |
+
| Tool Calling | BFCL | <u>88.8</u> | <u>88.8</u> | **95.5** |
|
| 213 |
+
| Multilingual Q&A | Global MMLU | <u>64.1</u> | 62.3 | **73.3** |
|
| 214 |
+
|
| 215 |
+
|
| 216 |
## Base Pre-Trained Model
|
| 217 |
+
For Ruler 64k evaluation, we apply YaRN to the Qwen models with 32k context to extrapolate the context length.
|
| 218 |
|
| 219 |
### English benchmarks
|
| 220 |
Note: All evaluations are zero-shot unless stated otherwise.
|
|
|
|
| 243 |
### Multilingual benchmarks
|
| 244 |
|
| 245 |
|
|
|
|
| 246 |
| Category | Metric | SmolLM3 3B Base | Qwen2.5-3B | Llama3.2 3B | Qwen3 1.7B Base | Qwen3 4B Base |
|
| 247 |
|---------|--------|---------------------|------------|--------------|------------------|---------------|
|
| 248 |
| Main supported languages | | | | | | | |
|
|
|
|
| 281 |
| | Global MMLU (CF) | <u>36.51</u> | 32.47 | 34.52 | 34.83 | **38.80** |
|
| 282 |
| | Flores-200 (5-shot) | 47.13 | 48.74 | 50.74 | <u>54.70</u> | **60.53** |
|
| 283 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 284 |
## Training
|
| 285 |
|
| 286 |
### Model
|