Submitted by Wenkai Yang 37 LaSeR: Reinforcement Learning with Last-Token Self-Rewarding Tencent Hunyuan 18 2