MuToR: Multi-Token prediction with Registers

Arxiv: https://arxiv.org/abs/2505.10518

TL;DR: MuToR is a simple, plug-and-play approach for multi-token prediction. It leverages dummy register tokens to predict multiple targets in the future, enriching the supervisory signal and improving performance across diverse settings and modalities. The register tokens are discarded on inference, leaving generation speed unchanged.

Model Description

This model is a finetuned version of Llama 3 8B. It was finetuned using the MuToR method for 5 epochs on the GSM8K training split. Please refer to our code for guidelines on how to use the models to reproduce our results.

Downloads last month: 3

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for nasos10/MuToR-llama3-8B-GSM8K-dmax_4_a_03

Base model

meta-llama/Meta-Llama-3-8B

Finetuned

(465)

this model

nasos10
/

MuToR-llama3-8B-GSM8K-dmax_4_a_03

MuToR: Multi-Token prediction with Registers

Model Description

Model tree for nasos10/MuToR-llama3-8B-GSM8K-dmax_4_a_03

Dataset used to train nasos10/MuToR-llama3-8B-GSM8K-dmax_4_a_03