RewardBench 指标不匹配

by Zeon-Zhuang - opened Jul 3

Jul 3

我测评了模型在 reward bench v1 上的指标，小分如下：
{'Chat': 0.9050279329608939, 'Chat Hard': 0.3793859649122807, 'Safety': 0.8527027027027027, 'Reasoning': 0.8846960167714885, 'overall': 0.7554531543368415, 'micro_average': 0.8020100502512563}
和报告的结果不一致，方便提供具体的可复现代码或推理结果吗

chrisliu298

Skywork org Jul 3

•

edited Jul 3

clone 最新版 reward-bench repo 之后，安装相应 dependency，运行以下命令即可：

CUDA_VISIBLE_DEVICES=0 python scripts/run_rm.py \
        --model=Skywork/Skywork-Reward-V2-Qwen3-4B \
        --batch_size=8 \
        --max_length=16384 \
        --not_quantized

Zeon-Zhuang changed discussion status to closed Jul 3

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment