RewardBench 指标不匹配
#2
by
Zeon-Zhuang
- opened
我测评了模型在 reward bench v1 上的指标,小分如下:
{'Chat': 0.9050279329608939, 'Chat Hard': 0.3793859649122807, 'Safety': 0.8527027027027027, 'Reasoning': 0.8846960167714885, 'overall': 0.7554531543368415, 'micro_average': 0.8020100502512563}
和报告的结果不一致,方便提供具体的可复现代码或推理结果吗
clone 最新版 reward-bench repo 之后,安装相应 dependency,运行以下命令即可:
CUDA_VISIBLE_DEVICES=0 python scripts/run_rm.py \
--model=Skywork/Skywork-Reward-V2-Qwen3-4B \
--batch_size=8 \
--max_length=16384 \
--not_quantized
Zeon-Zhuang
changed discussion status to
closed