MOTIF paper
Collection
MOTIF trained model and Vanilla GRPO trained model, compared in the paper.
β’
3 items
β’
Updated
π Paper link: Arxiv preprint
π Github link: Training and evaluation code
π Link to the trained models: Hugging Face collection
The INFTYTHINK architecture, shown below, allows multi-round thinking for extended LLM reasoning beyond its context size.
In this work, we propose a GRPO based training method for such a system that allows to calculate the accuracy reward by rolling out trajectories and applying the reward at the first round of inference outcomes. This is depicted as following:
Our results are shown below:
from transformers import AutoModelForCausalLM
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained("unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit")
model = PeftModel.from_pretrained(base_model, "purbeshmitra/vanillaGRPO")
SYSTEM_PROMPT = "You are a helpful assistant. When the user asks a question, you first think about the reasoning process in mind and then provide the user with an answer. The reasoning process and the answer are enclosed within <reasoning> </reasoning> and <answer> </answer> tags, respectively. In your answer, you also enclose your final answer in the box: \\boxed{}. Therefore, you respond in the following strict format:
<reasoning> reasoning process here </reasoning> <answer> answer here </answer>."
If you find our work useful, consider citing it as:
@article{mitra2025motif,
title={MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs},
author={Mitra, Purbesh and Ulukus, Sennur},
journal={arXiv preprint arXiv:2507.02851},
year={2025}
}