RewardBench 指标不匹配

#2
by Zeon-Zhuang - opened

我测评了模型在 reward bench v1 上的指标,小分如下:
{'Chat': 0.9050279329608939, 'Chat Hard': 0.3793859649122807, 'Safety': 0.8527027027027027, 'Reasoning': 0.8846960167714885, 'overall': 0.7554531543368415, 'micro_average': 0.8020100502512563}
和报告的结果不一致,方便提供具体的可复现代码或推理结果吗

Skywork org
edited Jul 3

clone 最新版 reward-bench repo 之后,安装相应 dependency,运行以下命令即可:

CUDA_VISIBLE_DEVICES=0 python scripts/run_rm.py \
        --model=Skywork/Skywork-Reward-V2-Qwen3-4B \
        --batch_size=8 \
        --max_length=16384 \
        --not_quantized
Zeon-Zhuang changed discussion status to closed

Sign up or log in to comment