cmpatino commited on
Commit
1b310b4
·
1 Parent(s): 6e4efe2

Add examples for using the model

Browse files
Files changed (1) hide show
  1. README.md +118 -43
README.md CHANGED
@@ -43,7 +43,8 @@ The model is a decoder-only transformer using GQA and NoRope, it was pretrained
43
 
44
  For more details refer to our blog post: TODO
45
 
46
- ### How to use
 
47
  The modeling code for SmolLM3 is available in transformers `v4.53.0`, so make sure to upgrade your transformers version. You can also load the model with the latest `vllm` which uses transformers as a backend.
48
  ```bash
49
  pip install -U transformers
@@ -52,16 +53,128 @@ pip install -U transformers
52
  ```python
53
  from transformers import AutoModelForCausalLM, AutoTokenizer
54
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
  checkpoint = "HuggingFaceTB/SmolLM3-3B"
56
- device = "cuda" # for GPU usage or "cpu" for CPU usage
57
  tokenizer = AutoTokenizer.from_pretrained(checkpoint)
58
- # for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
59
- model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
60
- inputs = tokenizer.encode("Gravity is", return_tensors="pt").to(device)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
  outputs = model.generate(inputs)
62
  print(tokenizer.decode(outputs[0]))
63
  ```
64
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
  For local inference, you can use `llama.cpp`, `ONNX`, `MLX` and `MLC`. You can find quantized checkpoints in this collection [TODO].
66
 
67
  ## Evaluation
@@ -188,44 +301,6 @@ Here is an infographic with all the training details [TODO].
188
  - The datasets used for pretraining can be found in this [collection](https://huggingface.co/collections/HuggingFaceTB/smollm3-pretraining-datasets-685a7353fdc01aecde51b1d9) and those used in mid-training and pos-training can be found here [TODO]
189
  - The training and evaluation configs and code can be found in the [huggingface/smollm](https://github.com/huggingface/smollm) repository.
190
 
191
- ## Agentic Usage
192
-
193
- SmolLM3 supports tool calling! Just pass your list of tools under the argument `xml_tools` (for standard tool-calling), or `python_tools` (for calling tools like python functions in a <code> snippet).
194
-
195
- ```python
196
- from transformers import AutoModelForCausalLM, AutoTokenizer
197
-
198
- checkpoint = "HuggingFaceTB/SmolLM3-3B"
199
-
200
- tokenizer = AutoTokenizer.from_pretrained(checkpoint)
201
- model = AutoModelForCausalLM.from_pretrained(checkpoint)
202
-
203
- tools = [
204
- {
205
- "name": "get_weather",
206
- "description": "Get the weather in a city",
207
- "parameters": {"type": "object", "properties": {"city": {"type": "string", "description": "The city to get the weather for"}}}}
208
- ]
209
-
210
- messages = [
211
- {
212
- "role": "user",
213
- "content": "Hello! How is the weather today in Copenhagen?"
214
- }
215
- ]
216
-
217
- inputs = tokenizer.apply_chat_template(
218
- messages,
219
- enable_thinking=False, # True works as well, your choice!
220
- xml_tools=tools,
221
- add_generation_prompt=True,
222
- tokenize=True,
223
- return_tensors="pt"
224
- )
225
-
226
- outputs = model.generate(inputs)
227
- print(tokenizer.decode(outputs[0]))
228
- ```
229
 
230
  ## Limitations
231
 
 
43
 
44
  For more details refer to our blog post: TODO
45
 
46
+ ## How to use
47
+
48
  The modeling code for SmolLM3 is available in transformers `v4.53.0`, so make sure to upgrade your transformers version. You can also load the model with the latest `vllm` which uses transformers as a backend.
49
  ```bash
50
  pip install -U transformers
 
53
  ```python
54
  from transformers import AutoModelForCausalLM, AutoTokenizer
55
 
56
+ model_name = "HuggingFaceTB/SmolLM3-3B"
57
+ device = "cuda" # for GPU usage or "cpu" for CPU usage
58
+
59
+ # load the tokenizer and the model
60
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
61
+ model = AutoModelForCausalLM.from_pretrained(
62
+ model_name,
63
+ ).to(device)
64
+
65
+ # prepare the model input
66
+ prompt = "Give me a brief explanation of gravity in simple terms."
67
+ messages_think = [
68
+ {"role": "user", "content": prompt}
69
+ ]
70
+
71
+ text = tokenizer.apply_chat_template(
72
+ messages_think,
73
+ tokenize=False,
74
+ add_generation_prompt=True,
75
+ )
76
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
77
+
78
+ # Generate the output
79
+ generated_ids = model.generate(**model_inputs, max_new_tokens=32768)
80
+
81
+ # Get and decode the output
82
+ output_ids = generated_ids[0][len(model_inputs.input_ids[0]) :]
83
+ print(tokenizer.decode(output_ids, skip_special_tokens=True))
84
+ ```
85
+
86
+
87
+ ### Enabling and Disabling Extended Thinking Mode
88
+
89
+ We enable extended thinking by default, so the example above generates the output with a reasoning trace. For choosing between enabling, you can provide the `/think` and `/no_think` flags through the system prompt as shown in the snippet below for extended thinking disabled. The code for generating the response with extended thinking would be the same except that the system prompt should have `/think` instead of `/no_think`.
90
+
91
+ ```python
92
+ prompt = "Give me a brief explanation of gravity in simple terms."
93
+ messages = [
94
+ {"role": "system", "content": "/no_think"},
95
+ {"role": "user", "content": prompt}
96
+ ]
97
+
98
+ text = tokenizer.apply_chat_template(
99
+ messages,
100
+ tokenize=False,
101
+ add_generation_prompt=True,
102
+ )
103
+ ```
104
+
105
+ We also provide the option of specifying the whether to use extended thinking through the `enable_thinking` kwarg as in the example below. You do not need to set the `/no_think` or `/think` flags through the system prompt if using the kwarg, but keep in mind that the flag in the system prompt overwrites the setting in the kwarg.
106
+
107
+ ```python
108
+ prompt = "Give me a brief explanation of gravity in simple terms."
109
+ messages = [
110
+ {"role": "user", "content": prompt}
111
+ ]
112
+
113
+ text = tokenizer.apply_chat_template(
114
+ messages,
115
+ tokenize=False,
116
+ add_generation_prompt=True,
117
+ enable_thinking=False
118
+ )
119
+ ```
120
+
121
+ ### Agentic Usage
122
+
123
+ SmolLM3 supports tool calling! Just pass your list of tools under the argument `xml_tools` (for standard tool-calling), or `python_tools` (for calling tools like python functions in a <code> snippet).
124
+
125
+ ```python
126
+ from transformers import AutoModelForCausalLM, AutoTokenizer
127
+
128
  checkpoint = "HuggingFaceTB/SmolLM3-3B"
129
+
130
  tokenizer = AutoTokenizer.from_pretrained(checkpoint)
131
+ model = AutoModelForCausalLM.from_pretrained(checkpoint)
132
+
133
+ tools = [
134
+ {
135
+ "name": "get_weather",
136
+ "description": "Get the weather in a city",
137
+ "parameters": {"type": "object", "properties": {"city": {"type": "string", "description": "The city to get the weather for"}}}}
138
+ ]
139
+
140
+ messages = [
141
+ {
142
+ "role": "user",
143
+ "content": "Hello! How is the weather today in Copenhagen?"
144
+ }
145
+ ]
146
+
147
+ inputs = tokenizer.apply_chat_template(
148
+ messages,
149
+ enable_thinking=False, # True works as well, your choice!
150
+ xml_tools=tools,
151
+ add_generation_prompt=True,
152
+ tokenize=True,
153
+ return_tensors="pt"
154
+ )
155
+
156
  outputs = model.generate(inputs)
157
  print(tokenizer.decode(outputs[0]))
158
  ```
159
 
160
+ ### Using Custom System Instructions.
161
+
162
+ You can specify custom instruction through the system prompt while controlling whether to use extended thinking. For example, the snippet below shows how to make the model speak like a pirate while enabling extended thinking.
163
+
164
+ ```python
165
+ prompt = "Give me a brief explanation of gravity in simple terms."
166
+ messages = [
167
+ {"role": "system", "content": "Speak like a pirate. /think"},
168
+ {"role": "user", "content": prompt}
169
+ ]
170
+
171
+ text = tokenizer.apply_chat_template(
172
+ messages,
173
+ tokenize=False,
174
+ add_generation_prompt=True,
175
+ )
176
+ ```
177
+
178
  For local inference, you can use `llama.cpp`, `ONNX`, `MLX` and `MLC`. You can find quantized checkpoints in this collection [TODO].
179
 
180
  ## Evaluation
 
301
  - The datasets used for pretraining can be found in this [collection](https://huggingface.co/collections/HuggingFaceTB/smollm3-pretraining-datasets-685a7353fdc01aecde51b1d9) and those used in mid-training and pos-training can be found here [TODO]
302
  - The training and evaluation configs and code can be found in the [huggingface/smollm](https://github.com/huggingface/smollm) repository.
303
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
304
 
305
  ## Limitations
306