This model is trained using OpenRLHF.
Looking to dive deeper into LLMs? Explore learning resources, tutorials, and guides at AI Roadmap โ your go-to platform for practical LLM training and deployment knowledge.
- Initialized from SmolLM2-135M-Instruct model.
- Using dataset preference_dataset_mixture2_and_safe_pku. Shuffled by dataset.shuffle(seed=42). After shuffling:
- 480k used for training
- 4k used for evaluation
- 40k used for final testing
- Training using one RTX 3090, for total 29 hours.
- Training command:
deepspeed --module openrlhf.cli.train_rm \ --max_len 2048 \ --dataset ./dataset/train \ --eval_dataset ./dataset/eval \ --chosen_key chosen \ --rejected_key rejected \ --apply_chat_template \ --train_batch_size 8 \ --micro_train_batch_size 8 \ --pretrain HuggingFaceTB/SmolLM2-135M-Instruct \ --save_path ./checkpoint/SmolLM2-135M-Reward \ --save_steps 1000 \ --logging_steps 1 \ --eval_steps 1000 \ --zero_stage 0 \ --max_epochs 1 \ --bf16 \ --learning_rate 9e-6 \ --use_wandb your_40_digit_wandb_token \ --wandb_project OpenRLHF_rm_train \ --wandb_run_name qwen3-0.6B-SFT \ --gradient_checkpointing - To run the model, please use:
from transformers import AutoTokenizer from openrlhf.models import get_llm_for_sequence_regression pretrain = "AI-Roadmap/SmolLM2-135M-rm-60k" model = get_llm_for_sequence_regression( model_name_or_path=pretrain, model_type="reward", use_flash_attention_2=False, bf16=True, init_value_head=False, ) tokenizer = AutoTokenizer.from_pretrained( pretrain, trust_remote_code=True, ) model.eval().to("cuda") - Test accuracy: 0.764425, chosen reward: -0.032832, reject reward: -1.620852
Want to learn how to build and fine-tune models like this? Visit AI Roadmap for more learning materials and LLM insights to supercharge your AI journey!
- Downloads last month
- 6
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support