JarvisArt: Tech Poster

Abstract

We introduce JarvisArt, a multi-modal large language model (MLLM)-driven agent that understands user intent, mimics professional artists' reasoning, and intelligently coordinates over 200 retouching tools within Lightroom. JarvisArt undergoes a two-stage training process and demonstrates user-friendly interaction, superior generalization, and fine-grained control over both global and local adjustments. Notably, it outperforms GPT-4o with a 60% improvement in average pixel-level metrics on our MMArt-Bench benchmark while maintaining comparable instruction-following capabilities.

Professional Reasoning

Mimics the reasoning process of professional artists through Chain-of-Thought supervised fine-tuning and GRPO-R optimization.

Comprehensive Toolset

Intelligently coordinates over 200 retouching tools within Lightroom for both global and local adjustments.

User-Friendly Interaction

Supports intuitive, free-form edits through natural inputs like text prompts, bounding boxes, or brushstrokes.

Figure 1: JarvisArt supports multi-granularity retouching through natural inputs and edits any-resolution images.

200+

Retouching Tools

60%

Improvement over GPT-4o

55K

Training Samples

∞

Image Resolution Support