long outputs
Could you comment on the issue of the long outputs generated by the latest reasoning models? Are they expected to produce thousands of tokens for each prompt?
Yes, these models are expected to think for many tokens before finalizing the answer. We recommend using 64K output tokens. It should be possible to make them more efficient in token usage or even add a controllable token budget with a separate round of RL, but we didn't do it yet.
I see that you changed the eos_token. Will it affect this behaviour?
No, it should only affect things if you create a finetuned version of the model. The current models' behavior should stay the same
Could you help clarify what impact this has on generation?
If the model was originally trained to emit 151643 as the eos token, but the runtime now expects 151645, wouldn't that cause a mismatch, where generation might not stop unless the new token happens to be emitted?
Does the model actually emit 151645 under current weights, or was it trained to use 151643? (which EOS token is actually the "correct" one from the model’s point of view?)
Also, as I understand it, this change would require re-exporting the model to GGUF, since llama.cpp converter read these config files. So even though the model weights remain unchanged, a new GGUF would need to be generated to reflect the updated eos_token and its ID.
The model would always end with <|im_end|>\n<|endoftext|>, which corresponds to [151645, 198, 151643]. So basically, the new change will stop it 2 tokens before, which shouldn't really matter in most situations. But if you finetune this model on the new data without <|endoftext|>, it will not properly stop without this pr we merged
Thank you