|  | --- | 
					
						
						|  | model-index: | 
					
						
						|  | - name: tulu-v2.5-13b-preference-mix-rm | 
					
						
						|  | results: [] | 
					
						
						|  | datasets: | 
					
						
						|  | - allenai/tulu-2.5-preference-data | 
					
						
						|  | - allenai/tulu-v2-sft-mixture | 
					
						
						|  | language: | 
					
						
						|  | - en | 
					
						
						|  | base_model: allenai/tulu-2-13b | 
					
						
						|  | license: apache-2.0 | 
					
						
						|  | --- | 
					
						
						|  | <center> | 
					
						
						|  | <img src="https://huggingface.co/datasets/allenai/blog-images/resolve/main/tulu-2.5/tulu_25_banner.png" alt="Tulu 2.5 banner image" width="800px"/> | 
					
						
						|  | </center> | 
					
						
						|  |  | 
					
						
						|  | # Model Card for Tulu V2.5 13B RM - Preference Mix | 
					
						
						|  |  | 
					
						
						|  | Tulu is a series of language models that are trained to act as helpful assistants. | 
					
						
						|  | Tulu V2.5 is a series of models trained using DPO and PPO starting from the [Tulu 2 suite](https://huggingface.co/collections/allenai/tulu-v2-suite-6551b56e743e6349aab45101). | 
					
						
						|  | This is a reward model used for PPO training trained on our preference data mixture. | 
					
						
						|  | It was used to train [this](https://huggingface.co/allenai/tulu-v2.5-ppo-13b-uf-mean-13b-mix-rm) model. | 
					
						
						|  |  | 
					
						
						|  | For more details, read the paper: | 
					
						
						|  | [Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://arxiv.org/abs/2406.09279). | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | ## .Model description | 
					
						
						|  |  | 
					
						
						|  | - **Model type:** One model belonging to a suite of RLHF tuned chat models on a mix of publicly available, synthetic and human-created datasets. | 
					
						
						|  | - **Language(s) (NLP):** English | 
					
						
						|  | - **License:** Apache 2.0. | 
					
						
						|  | - **Finetuned from model:** [meta-llama/Llama-2-13b-hf](https://huggingface.co/meta-llama/Llama-2-13b-hf) | 
					
						
						|  |  | 
					
						
						|  | ### Model Sources | 
					
						
						|  |  | 
					
						
						|  | - **Repository:** https://github.com/allenai/open-instruct | 
					
						
						|  | - **Dataset:** Data used to train this model can be found [here](https://huggingface.co/datasets/allenai/tulu-2.5-preference-data) - specifically the `preference_big_mixture` split. | 
					
						
						|  | - **Model Family:** The collection of related models can be found [here](https://huggingface.co/collections/allenai/tulu-v25-suite-66676520fd578080e126f618). | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | ## Input Format | 
					
						
						|  |  | 
					
						
						|  | The model is trained to use the following format (note the newlines): | 
					
						
						|  | ``` | 
					
						
						|  | <|user|> | 
					
						
						|  | Your message here! | 
					
						
						|  | <|assistant|> | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | For best results, format all inputs in this manner. **Make sure to include a newline after `<|assistant|>`, this can affect generation quality quite a bit.** | 
					
						
						|  | We have included a [chat template](https://huggingface.co/docs/transformers/main/en/chat_templating) in the tokenizer implementing this template. | 
					
						
						|  |  | 
					
						
						|  | ## Intended uses & limitations | 
					
						
						|  |  | 
					
						
						|  | The model was initially fine-tuned on a filtered and preprocessed of the [Tulu V2 mix dataset](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture), which contains a diverse range of human created instructions and synthetic dialogues generated primarily by other LLMs. | 
					
						
						|  | We then further trained the model with a [Jax RM trainer](https://github.com/hamishivi/EasyLM/blob/main/EasyLM/models/llama/llama_train_rm.py) built on [EasyLM](https://github.com/young-geng/EasyLM) on the dataset mentioned above. | 
					
						
						|  | This model is meant as a research artefact. | 
					
						
						|  |  | 
					
						
						|  | ### Training hyperparameters | 
					
						
						|  |  | 
					
						
						|  | The following hyperparameters were used during PPO training: | 
					
						
						|  | - learning_rate: 1e-06 | 
					
						
						|  | - total_train_batch_size: 512 | 
					
						
						|  | - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 | 
					
						
						|  | - lr_scheduler_type: linear cooldown to 1e-05. | 
					
						
						|  | - lr_scheduler_warmup_ratio: 0.03 | 
					
						
						|  | - num_epochs: 1.0 | 
					
						
						|  |  | 
					
						
						|  | ## Citation | 
					
						
						|  |  | 
					
						
						|  | If you find Tulu 2.5 is useful in your work, please cite it with: | 
					
						
						|  |  | 
					
						
						|  | ``` | 
					
						
						|  | @misc{ivison2024unpacking, | 
					
						
						|  | title={{Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback}}, | 
					
						
						|  | author={{Hamish Ivison and Yizhong Wang and Jiacheng Liu and Ellen Wu and Valentina Pyatkin and Nathan Lambert and Yejin Choi and Noah A. Smith and Hannaneh Hajishirzi}} | 
					
						
						|  | year={2024}, | 
					
						
						|  | eprint={2406.09279}, | 
					
						
						|  | archivePrefix={arXiv}, | 
					
						
						|  | primaryClass={cs.CL} | 
					
						
						|  | } | 
					
						
						|  | ``` | 
					
						
						|  |  |