JarvisArt

Liberating Human Artistic Creativity via an Intelligent Photo Retouching Agent

Yunlong Lin* Zixu Lin* Kunjie Lin* Jinbin Bai Panwang Pan Chenxin Li Haoyu Chen Zhongdao Wang Xinghao Ding† Wenbo Li♣ Shuicheng Yan†
Xiamen University HKUST(GZ) CUHK Bytedance NUS Tsinghua University
Project Page

Abstract

We introduce JarvisArt, a multi-modal large language model (MLLM)-driven agent that understands user intent, mimics professional artists' reasoning, and intelligently coordinates over 200 retouching tools within Lightroom. JarvisArt undergoes a two-stage training process and demonstrates user-friendly interaction, superior generalization, and fine-grained control over both global and local adjustments. Notably, it outperforms GPT-4o with a 60% improvement in average pixel-level metrics on our MMArt-Bench benchmark while maintaining comparable instruction-following capabilities.

Professional Reasoning

Mimics the reasoning process of professional artists through Chain-of-Thought supervised fine-tuning and GRPO-R optimization.

Comprehensive Toolset

Intelligently coordinates over 200 retouching tools within Lightroom for both global and local adjustments.

User-Friendly Interaction

Supports intuitive, free-form edits through natural inputs like text prompts, bounding boxes, or brushstrokes.

JarvisArt Interface
Figure 1: JarvisArt supports multi-granularity retouching through natural inputs and edits any-resolution images.
200+
Retouching Tools
60%
Improvement over GPT-4o
55K
Training Samples
Image Resolution Support