Transformers

short sft config

#8
by zhengwenzhen - opened

I cannot find the training or data config about the "short sft" in https://github.com/huggingface/alignment-handbook/tree/main/recipes/smollm3. Is there any information about the data recipe or the total number of training tokens?

Hugging Face Smol Models Research org

Hi @zhengwenzhen ! All the subsets without reasoning (i.e. "short") are denoted by a _no_think suffix in the data mixer here: https://github.com/huggingface/alignment-handbook/blob/main/recipes/smollm3/sft/sft.yaml

Thanks! I'm referring to the 'short SFT' phase after the long-context training used for the model soup merge. Specifically, for the SFT data used to train from stage3-step-4720000 to it-LC-expert, did you mean that this data follows the recipe in [sft.yaml], but is restricted to the subset with the _no_think suffix? Could you also share the total number of samples (or tokens) involved in this specific phase? Thanks in advance!

@zhengwenzhen , I found the following from the blog. It seems like doing SFT directly on the mid trained checkpoint using non thinking SFT datasets. Now just on a pre-trained long context one.

"Combine the model soup with a mid-training checkpoint that has strong long-content performance. A linear merge with weights of 0.9 and 0.1 for the APO model soup and mid-training checkpoint, respectively, achieved the best performance. We were able to recover the base model’s RULER score on contexts up to 128k tokens."

Sign up or log in to comment