arxiv:2507.15493

GR-3 Technical Report

Published on Jul 21

· Submitted by

CH3COOK on Jul 22

Upvote

Authors:

Sijin Chen ,

Yingdong Hu ,

Liqun Huang ,

Xiao Ma ,

Wenxuan Ou ,

Wanli Peng ,

Xin Xiao ,

Yuyang Xiao ,

Jiafeng Xu ,

Abstract

A large-scale vision-language-action model demonstrates exceptional generalization, fine-tuning efficiency, and robust performance in complex robotic tasks, outperforming existing baselines.

AI-generated summary

We report our recent progress towards building generalist robot policies, the development of GR-3. GR-3 is a large-scale vision-language-action (VLA) model. It showcases exceptional capabilities in generalizing to novel objects, environments, and instructions involving abstract concepts. Furthermore, it can be efficiently fine-tuned with minimal human trajectory data, enabling rapid and cost-effective adaptation to new settings. GR-3 also excels in handling long-horizon and dexterous tasks, including those requiring bi-manual manipulation and mobile movement, showcasing robust and reliable performance. These capabilities are achieved through a multi-faceted training recipe that includes co-training with web-scale vision-language data, efficient fine-tuning from human trajectory data collected via VR devices, and effective imitation learning with robot trajectory data. In addition, we introduce ByteMini, a versatile bi-manual mobile robot designed with exceptional flexibility and reliability, capable of accomplishing a wide range of tasks when integrated with GR-3. Through extensive real-world experiments, we show GR-3 surpasses the state-of-the-art baseline method, pi_0, on a wide variety of challenging tasks. We hope GR-3 can serve as a step towards building generalist robots capable of assisting humans in daily life.

View arXiv page View PDF Project page Add to collection