CityRiSE: Reasoning Urban Socio-Economic Status in Vision-Language Models via Reinforcement Learning
Abstract
CityRiSE, a reinforcement learning framework, enhances Large Vision-Language Models for accurate and interpretable urban socio-economic status prediction using multi-modal data.
Harnessing publicly available, large-scale web data, such as street view and satellite imagery, urban socio-economic sensing is of paramount importance for achieving global sustainable development goals. With the emergence of Large Vision-Language Models (LVLMs), new opportunities have arisen to solve this task by treating it as a multi-modal perception and understanding problem. However, recent studies reveal that LVLMs still struggle with accurate and interpretable socio-economic predictions from visual data. To address these limitations and maximize the potential of LVLMs, we introduce CityRiSE, a novel framework for Reasoning urban Socio-Economic status in LVLMs through pure reinforcement learning (RL). With carefully curated multi-modal data and verifiable reward design, our approach guides the LVLM to focus on semantically meaningful visual cues, enabling structured and goal-oriented reasoning for generalist socio-economic status prediction. Experiments demonstrate that CityRiSE with emergent reasoning process significantly outperforms existing baselines, improving both prediction accuracy and generalization across diverse urban contexts, particularly for prediction on unseen cities and unseen indicators. This work highlights the promise of combining RL and LVLMs for interpretable and generalist urban socio-economic sensing.
Community
CityRiSE is a novel framework that guides Large Vision-Language Models to achieve accurate and interpretable urban socio-economic predictions, setting a new standard for generalization across unseen cities and indicators.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Urban-R1: Reinforced MLLMs Mitigate Geospatial Biases for Urban General Intelligence (2025)
- Activating Visual Context and Commonsense Reasoning through Masked Prediction in VLMs (2025)
- Geo-R1: Unlocking VLM Geospatial Reasoning with Cross-View Reinforcement Learning (2025)
- MIRG-RL: Multi-Image Reasoning and Grounding with Reinforcement Learning (2025)
- GeoRef: Referring Expressions in Geometry via Task Formulation, Synthetic Supervision, and Reinforced MLLM-based Solutions (2025)
- Geo-R1: Improving Few-Shot Geospatial Referring Expression Understanding with Reinforcement Fine-Tuning (2025)
- GeoVLM-R1: Reinforcement Fine-Tuning for Improved Remote Sensing Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper