Text Classification
Transformers
Safetensors
qwen2
text-generation
text-generation-inference

Robust Reward Model for LLM-as-a-Judge

This repository contains a robust, general-domain generative reward model presented in the paper One Token to Fool LLM-as-a-Judge.

Model Description

Generative reward models (also known as LLMs-as-judges), which use large language models (LLMs) to evaluate answer quality, are increasingly adopted in reinforcement learning with verifiable rewards (RLVR). They are often preferred over rigid rule-based metrics, especially for complex reasoning tasks involving free-form outputs. Despite the seeming simplicity of this comparison task, existing generative reward models exhibit surprising vulnerabilities to superficial manipulations: non-word symbols (e.g., ":" or ".") or reasoning openers like "Thought process:" and "Let's solve this problem step by step." can often lead to false positive rewards.

We find that such weakness is widespread across various LLMs, datasets, and prompt formats, posing a serious threat to core algorithmic paradigms relying on generative reward models, such as rejection sampling, preference optimization, and RLVR.

To mitigate this issue, we train a robust general-domain generative model by leverating a simple yet effective data augmentation strategy. Our reward model demonstates substantially improved robustness over the most advanced commencial models (e.g., GPT-4o, GPT-o1, Claude-4) and specialized generative verifiers (e.g., Omni-Judge, Generative-Verifier).

How to use

Inputting the question, its ground-truth reference, and the response to be evaluated, the model will judge its correctness. An example inference script is provided below.

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("sarosavo/Master-RM")
model = AutoModelForCausalLM.from_pretrained("sarosavo/Master-RM")

PROMPT= '''
Given a problem, determine whether the final answer in the provided (incomplete) solution process matches the reference answer.  
The reference answer may be one single option character (e.g., A, B, C, D), a numerical value, an expression, or a list of answers if multiple questions are involved.  
**The reference answer may be in Chinese or another language, but your evaluation should be language-agnostic.**  

Your task:  
- Compare the final output of the solution process with the reference answer.  
- If they **match exactly**, output **YES**.  
- If they **do not match**, output **NO**.  
- If the solution process is unclear, incomplete, or ambiguous, assume it is incorrect and output **NO**.  

Your output must be strictly **'YES'** or **'NO'**, with no additional words, punctuation, or explanation.  

---

**Question:**  
{question}  

**Solution Process (Final Step Only):**  
{response}  

**Reference Answer:**  
{reference}  

**Output:**  
'''


question="The founder of China's first public kindergarten teacher training school - Jiangxi Experimental Kindergarten Teacher School is (  )."
label="Chen Heqin"
answer="heqin chen"

prompt_question = PROMPT.format(question=question, reference=label, response=answer)
messages=[
           {"role": "system", "content": "You are a helpful assistant."},
           {"role": "user", "content": prompt_question},
         ]

input_ids=tokenizer.apply_chat_template(messages,return_tensors="pt")
output=model.generate(input_ids,do_sample=False)
judgement=tokenizer.decode(output[0][input_ids.shape[1]:],skip_special_tokens=True)
print("Model judgement: ",judgement)

Use this reward model for RLVR training

1. Launch a remote reward server with vllm

The script below will launch a reward at http://127.0.0.1:8000/get_reward

bash reward_server/launch_reward.sh {MODEL_PATH} {ANSWER_PATH} {METRIC}

# MODEL_PATH: the path of our reward model.
# ANSWER_PATH: the path of the training data.
# METRIC: greedy/prob
# This will launch a reward at http://127.0.0.1:8000/get_reward

2. Start RLVR training

bash reward_server/RLVR_train.sh {METHOD} {PRETRAIN_PATH} {DATA_PATH} {REWARD_API}

# METHOD:          advantage estimator, e.g., reinforce_baseline, reinforce, rloo
# PRETRAIN_PATH:   path to the pretrained model, e.g., Qwen2.5-7B
# DATA_PATH:       path to the QA data with which we want to perform RL reasoning
# REWARD_API:      remote reward server url, e.g., http://127.0.0.1:8000/get_reward

Citation

If you use this model, please cite:

@article{zhao2025one,
  title={One Token to Fool LLM-as-a-Judge},
  author={Zhao, Yulai and Liu, Haolin and Yu, Dian and Kung, S.Y. and Mi, Haitao and Yu, Dong},
  journal={arXiv preprint arXiv:2507.08794},
  year={2025}
}

Acknowledgements

The development of this model is built upon Qwen2.5-7B-Instruct-RLVR.

Downloads last month
133
Safetensors
Model size
7.62B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sarosavo/Master-RM

Base model

Qwen/Qwen2.5-7B
Finetuned
(2469)
this model
Quantizations
2 models

Datasets used to train sarosavo/Master-RM