arxiv:2510.27044

Limits of Generalization in RLVR: Two Case Studies in Mathematical Reasoning

Published on Oct 30

· Submitted by

Authors:

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) enhances evaluation metrics on combinatorial problems but often reinforces superficial heuristics rather than genuine reasoning strategies.

AI-generated summary

Mathematical reasoning is a central challenge for large language models (LLMs), requiring not only correct answers but also faithful reasoning processes. Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising approach for enhancing such capabilities; however, its ability to foster genuine reasoning remains unclear. We investigate RLVR on two combinatorial problems with fully verifiable solutions: Activity Scheduling and the Longest Increasing Subsequence, using carefully curated datasets with unique optima. Across multiple reward designs, we find that RLVR improves evaluation metrics but often by reinforcing superficial heuristics rather than acquiring new reasoning strategies. These findings highlight the limits of RLVR generalization, emphasizing the importance of benchmarks that disentangle genuine mathematical reasoning from shortcut exploitation and provide faithful measures of progress. Code available at https://github.com/xashru/rlvr-seq-generalization.

View arXiv page View PDF GitHub 0 Add to collection

Community

Tanvirul

Paper submitter 3 days ago

•

edited 3 days ago

This paper investigates RLVR on two combinatorial problems with fully verifiable solutions: Activity Scheduling and the Longest Increasing Subsequence, using carefully curated datasets with unique optima. Across multiple reward designs, we find that RLVR improves evaluation metrics but often by reinforcing superficial heuristics rather than acquiring new reasoning strategies. Code: https://github.com/xashru/rlvr-seq-generalization

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.27044 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.27044 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.27044 in a Space README.md to link it from this page.