HuggingFaceTB
/

SmolLM3-3B

@@ -43,7 +43,8 @@ The model is a decoder-only transformer using GQA and NoRope, it was pretrained
 For more details refer to our blog post: TODO
-### How to use
 The modeling code for SmolLM3 is available in transformers `v4.53.0`, so make sure to upgrade your transformers version. You can also load the model with the latest `vllm` which uses transformers as a backend.
 ```bash
 pip install -U transformers
@@ -52,16 +53,128 @@ pip install -U transformers
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 checkpoint = "HuggingFaceTB/SmolLM3-3B"
-device = "cuda" # for GPU usage or "cpu" for CPU usage
 tokenizer = AutoTokenizer.from_pretrained(checkpoint)
-# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
-model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
-inputs = tokenizer.encode("Gravity is", return_tensors="pt").to(device)
 outputs = model.generate(inputs)
 print(tokenizer.decode(outputs[0]))
 ```
 For local inference, you can use `llama.cpp`, `ONNX`, `MLX` and `MLC`. You can find quantized checkpoints in this collection [TODO].
 ## Evaluation
@@ -188,44 +301,6 @@ Here is an infographic with all the training details [TODO].
 - The datasets used for pretraining can be found in this [collection](https://huggingface.co/collections/HuggingFaceTB/smollm3-pretraining-datasets-685a7353fdc01aecde51b1d9) and those used in mid-training and pos-training can be found here [TODO]
 - The training and evaluation configs and code can be found in the [huggingface/smollm](https://github.com/huggingface/smollm) repository.
-## Agentic Usage
-SmolLM3 supports tool calling! Just pass your list of tools under the argument `xml_tools` (for standard tool-calling), or `python_tools` (for calling tools like python functions in a <code> snippet).
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-checkpoint = "HuggingFaceTB/SmolLM3-3B"
-tokenizer = AutoTokenizer.from_pretrained(checkpoint)
-model = AutoModelForCausalLM.from_pretrained(checkpoint)
-tools = [
-    {
-        "name": "get_weather",
-        "description": "Get the weather in a city",
-        "parameters": {"type": "object", "properties": {"city": {"type": "string", "description": "The city to get the weather for"}}}}
-]
-messages = [
-    {
-        "role": "user",
-        "content": "Hello! How is the weather today in Copenhagen?"
-    }
-]
-inputs = tokenizer.apply_chat_template(
-    messages,
-    enable_thinking=False, # True works as well, your choice!
-    xml_tools=tools,
-    add_generation_prompt=True,
-    tokenize=True,
-    return_tensors="pt"
-)
-outputs = model.generate(inputs)
-print(tokenizer.decode(outputs[0]))
-```
 ## Limitations

 For more details refer to our blog post: TODO
+## How to use
 The modeling code for SmolLM3 is available in transformers `v4.53.0`, so make sure to upgrade your transformers version. You can also load the model with the latest `vllm` which uses transformers as a backend.
 ```bash
 pip install -U transformers
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
+model_name = "HuggingFaceTB/SmolLM3-3B"
+device = "cuda"  # for GPU usage or "cpu" for CPU usage
+# load the tokenizer and the model
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+).to(device)
+# prepare the model input
+prompt = "Give me a brief explanation of gravity in simple terms."
+messages_think = [
+    {"role": "user", "content": prompt}
+]
+text = tokenizer.apply_chat_template(
+    messages_think,
+    tokenize=False,
+    add_generation_prompt=True,
+)
+model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
+# Generate the output
+generated_ids = model.generate(**model_inputs, max_new_tokens=32768)
+# Get and decode the output
+output_ids = generated_ids[0][len(model_inputs.input_ids[0]) :]
+print(tokenizer.decode(output_ids, skip_special_tokens=True))
+```
+### Enabling and Disabling Extended Thinking Mode
+We enable extended thinking by default, so the example above generates the output with a reasoning trace. For choosing between enabling, you can provide the `/think` and `/no_think` flags through the system prompt as shown in the snippet below for extended thinking disabled. The code for generating the response with extended thinking would be the same except that the system prompt should have `/think` instead of `/no_think`.
+```python
+prompt = "Give me a brief explanation of gravity in simple terms."
+messages = [
+    {"role": "system", "content": "/no_think"},
+    {"role": "user", "content": prompt}
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True,
+)
+```
+We also provide the option of specifying the whether to use extended thinking through the `enable_thinking` kwarg as in the example below. You do not need to set the `/no_think` or `/think` flags through the system prompt if using the kwarg, but keep in mind that the flag in the system prompt overwrites the setting in the kwarg.
+```python
+prompt = "Give me a brief explanation of gravity in simple terms."
+messages = [
+    {"role": "user", "content": prompt}
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True,
+    enable_thinking=False
+)
+```
+### Agentic Usage
+SmolLM3 supports tool calling! Just pass your list of tools under the argument `xml_tools` (for standard tool-calling), or `python_tools` (for calling tools like python functions in a <code> snippet).
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
 checkpoint = "HuggingFaceTB/SmolLM3-3B"
 tokenizer = AutoTokenizer.from_pretrained(checkpoint)
+model = AutoModelForCausalLM.from_pretrained(checkpoint)
+tools = [
+    {
+        "name": "get_weather",
+        "description": "Get the weather in a city",
+        "parameters": {"type": "object", "properties": {"city": {"type": "string", "description": "The city to get the weather for"}}}}
+]
+messages = [
+    {
+        "role": "user",
+        "content": "Hello! How is the weather today in Copenhagen?"
+    }
+]
+inputs = tokenizer.apply_chat_template(
+    messages,
+    enable_thinking=False, # True works as well, your choice!
+    xml_tools=tools,
+    add_generation_prompt=True,
+    tokenize=True,
+    return_tensors="pt"
+)
 outputs = model.generate(inputs)
 print(tokenizer.decode(outputs[0]))
 ```
+### Using Custom System Instructions.
+You can specify custom instruction through the system prompt while controlling whether to use extended thinking. For example, the snippet below shows how to make the model speak like a pirate while enabling extended thinking.
+```python
+prompt = "Give me a brief explanation of gravity in simple terms."
+messages = [
+    {"role": "system", "content": "Speak like a pirate.  /think"},
+    {"role": "user", "content": prompt}
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True,
+)
+```
 For local inference, you can use `llama.cpp`, `ONNX`, `MLX` and `MLC`. You can find quantized checkpoints in this collection [TODO].
 ## Evaluation
 - The datasets used for pretraining can be found in this [collection](https://huggingface.co/collections/HuggingFaceTB/smollm3-pretraining-datasets-685a7353fdc01aecde51b1d9) and those used in mid-training and pos-training can be found here [TODO]
 - The training and evaluation configs and code can be found in the [huggingface/smollm](https://github.com/huggingface/smollm) repository.
 ## Limitations