short sft config
I cannot find the training or data config about the "short sft" in https://github.com/huggingface/alignment-handbook/tree/main/recipes/smollm3. Is there any information about the data recipe or the total number of training tokens?
Hi
@zhengwenzhen
! All the subsets without reasoning (i.e. "short") are denoted by a _no_think suffix in the data mixer here: https://github.com/huggingface/alignment-handbook/blob/main/recipes/smollm3/sft/sft.yaml
Thanks! I'm referring to the 'short SFT' phase after the long-context training used for the model soup merge. Specifically, for the SFT data used to train from stage3-step-4720000 to it-LC-expert, did you mean that this data follows the recipe in [sft.yaml], but is restricted to the subset with the _no_think suffix? Could you also share the total number of samples (or tokens) involved in this specific phase? Thanks in advance!
@zhengwenzhen , I found the following from the blog. It seems like doing SFT directly on the mid trained checkpoint using non thinking SFT datasets. Now just on a pre-trained long context one.
"Combine the model soup with a mid-training checkpoint that has strong long-content performance. A linear merge with weights of 0.9 and 0.1 for the APO model soup and mid-training checkpoint, respectively, achieved the best performance. We were able to recover the base model’s RULER score on contexts up to 128k tokens."