|  | --- | 
					
						
						|  | license: apache-2.0 | 
					
						
						|  | base_model: Qwen/Qwen2-1.5B-Instruct | 
					
						
						|  | tags: | 
					
						
						|  | - alignment-handbook | 
					
						
						|  | - trl | 
					
						
						|  | - dpo | 
					
						
						|  | - generated_from_trainer | 
					
						
						|  | - trl | 
					
						
						|  | - dpo | 
					
						
						|  | - generated_from_trainer | 
					
						
						|  | datasets: | 
					
						
						|  | - princeton-nlp/llama3-ultrafeedback | 
					
						
						|  | model-index: | 
					
						
						|  | - name: qwen2-1.5b-instruct-simpo-lr-5e-07-gamma-1.5 | 
					
						
						|  | results: [] | 
					
						
						|  | --- | 
					
						
						|  |  | 
					
						
						|  | <!-- This model card has been generated automatically according to the information the Trainer had access to. You | 
					
						
						|  | should probably proofread and complete it, then remove this comment. --> | 
					
						
						|  | ## Description | 
					
						
						|  | This model was trained as part of the Reinforcement Learning - 24 project at Peking University, focusing on [simpo]. | 
					
						
						|  |  | 
					
						
						|  | ## Authors | 
					
						
						|  | - Ejafa Bassam | 
					
						
						|  | - Yaroslav Ponomarenko | 
					
						
						|  | # qwen2-1.5b-instruct-simpo-lr-5e-07-gamma-1.5 | 
					
						
						|  |  | 
					
						
						|  | This model is a fine-tuned version of [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) on the princeton-nlp/llama3-ultrafeedback dataset. | 
					
						
						|  | It achieves the following results on the evaluation set: | 
					
						
						|  | - Loss: 1.6346 | 
					
						
						|  | - Rewards/chosen: -2.6152 | 
					
						
						|  | - Rewards/rejected: -2.7999 | 
					
						
						|  | - Rewards/accuracies: 0.5685 | 
					
						
						|  | - Rewards/margins: 0.1847 | 
					
						
						|  | - Logps/rejected: -1.1200 | 
					
						
						|  | - Logps/chosen: -1.0461 | 
					
						
						|  | - Logits/rejected: -1.5578 | 
					
						
						|  | - Logits/chosen: -1.5356 | 
					
						
						|  |  | 
					
						
						|  | ## Model description | 
					
						
						|  |  | 
					
						
						|  | More information needed | 
					
						
						|  |  | 
					
						
						|  | ## Intended uses & limitations | 
					
						
						|  |  | 
					
						
						|  | More information needed | 
					
						
						|  |  | 
					
						
						|  | ## Training and evaluation data | 
					
						
						|  |  | 
					
						
						|  | More information needed | 
					
						
						|  |  | 
					
						
						|  | ## Training procedure | 
					
						
						|  |  | 
					
						
						|  | ### Training hyperparameters | 
					
						
						|  |  | 
					
						
						|  | The following hyperparameters were used during training: | 
					
						
						|  | - learning_rate: 5e-07 | 
					
						
						|  | - train_batch_size: 2 | 
					
						
						|  | - eval_batch_size: 4 | 
					
						
						|  | - seed: 42 | 
					
						
						|  | - distributed_type: multi-GPU | 
					
						
						|  | - num_devices: 8 | 
					
						
						|  | - gradient_accumulation_steps: 8 | 
					
						
						|  | - total_train_batch_size: 128 | 
					
						
						|  | - total_eval_batch_size: 32 | 
					
						
						|  | - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 | 
					
						
						|  | - lr_scheduler_type: cosine | 
					
						
						|  | - lr_scheduler_warmup_ratio: 0.1 | 
					
						
						|  | - num_epochs: 1 | 
					
						
						|  |  | 
					
						
						|  | ### Training results | 
					
						
						|  |  | 
					
						
						|  | | Training Loss | Epoch  | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen | | 
					
						
						|  | |:-------------:|:------:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:| | 
					
						
						|  | | 1.6402        | 0.8549 | 400  | 1.6353          | -2.6155        | -2.7990          | 0.5726             | 0.1835          | -1.1196        | -1.0462      | -1.5085         | -1.4841       | | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | ### Framework versions | 
					
						
						|  |  | 
					
						
						|  | - Transformers 4.41.2 | 
					
						
						|  | - Pytorch 2.3.1+cu121 | 
					
						
						|  | - Datasets 2.20.0 | 
					
						
						|  | - Tokenizers 0.19.1 | 
					
						
						|  |  |