short sft config

by zhengwenzhen - opened 8 days ago

8 days ago

I cannot find the training or data config about the "short sft" in https://github.com/huggingface/alignment-handbook/tree/main/recipes/smollm3. Is there any information about the data recipe or the total number of training tokens?

lewtun

Hugging Face Smol Models Research org 7 days ago

Hi @zhengwenzhen ! All the subsets without reasoning (i.e. "short") are denoted by a _no_think suffix in the data mixer here: https://github.com/huggingface/alignment-handbook/blob/main/recipes/smollm3/sft/sft.yaml

zhengwenzhen

7 days ago

Thanks! I'm referring to the 'short SFT' phase after the long-context training used for the model soup merge. Specifically, for the SFT data used to train from stage3-step-4720000 to it-LC-expert, did you mean that this data follows the recipe in [sft.yaml], but is restricted to the subset with the _no_think suffix? Could you also share the total number of samples (or tokens) involved in this specific phase? Thanks in advance!

gshasiri

5 days ago

@zhengwenzhen , I found the following from the blog. It seems like doing SFT directly on the mid trained checkpoint using non thinking SFT datasets. Now just on a pre-trained long context one.

"Combine the model soup with a mid-training checkpoint that has strong long-content performance. A linear merge with weights of 0.9 and 0.1 for the APO model soup and mid-training checkpoint, respectively, achieved the best performance. We were able to recover the base model’s RULER score on contexts up to 128k tokens."

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment