👉 Ctrl-World: A Controllable Generative World Model for Robot Manipulation

Yanjiang Guo*, Lucy Xiaoyang Shi*, Jianyu Chen, Chelsea Finn

*Equal contribution; Stanford University, Tsinghua University

TL; DR:

Ctrl-World is an action-conditioned world model compatible with modern VLA policies and enables policy-in-the-loop rollouts entirely in imagination, which can be used to evaluate and improve the instruction following ability of VLA.

wild-data

Model Details:

This repo include the Ctrl-World model checkpoint trained on opensourced DROID dataset (~95k trajectories, 564 scenes). The DROID platform consists of a Franka Panda robotic arm equipped with a Robotiq gripper and three cameras: two randomly placed third-person cameras and one wrist-mounted camera.

Usage

See the official Ctrl-World github repo for detailed usage.

Acknowledgement

Ctrl-World is developed from the opensourced video foundation model Stable-Video-Diffusion. The VLA model used in this repo is from openpi. We thank the authors for their efforts!

Bibtex

If you find our work helpful, please leave us a star and cite our paper. Thank you!

@article{guo2025ctrl,
  title={Ctrl-World: A Controllable Generative World Model for Robot Manipulation},
  author={Guo, Yanjiang and Shi, Lucy Xiaoyang and Chen, Jianyu and Finn, Chelsea},
  journal={arXiv preprint arXiv:2510.10125},
  year={2025}
}

Downloads last month: 32

Video Preview

Robotics

Model tree for yjguo/Ctrl-World

Base model

stabilityai/stable-video-diffusion-img2vid

Finetuned

(4)

this model

yjguo
/

Ctrl-World