Introduction

This repository contains the efficient GUI grounding model, UI-R1-E-3B, presented in UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning.

Project page: https://github.com/lll6gg/UI-R1

Old version: UI-R1-3B

Benchmark 1: ScreenSpotV2

ScreenSpotV2 inference mode Mobile-T Mobile-I Desktop-T Desktop-I Web-T Web-I Avg↑ / Len↓
OS-ATLAS-7B w/o thinking 95.2 75.8 90.7 63.6 90.6 77.3 84.1 /
UI-TARS-7B w/o thinking 95.2 79.1 90.7 68.6 90.6 78.3 84.7 /
UI-R1-3B (v1) w/ thinking 96.2 84.3 92.3 63.6 89.2 75.4 85.4 / 67
GUI-R1-3B w/ thinking 97.6 78.2 94.3 64.3 91.0 72.4 85.0 / 80
UI-R1-3B (v2) w/ thinking 97.6 79.6 92.3 67.9 88.9 77.8 85.8 / 60
UI-R1-E-3B w/o thinking 98.2 83.9 94.8 75.0 93.2 83.7 89.5 / 28

Benchmark 2: ScreenSpot-Pro

ScreenSpot-Pro inference mode Average Length↓ Average Accuracy↑
UGround-7B w/o thinking - 16.5
OS-ATLAS-7B w/o thinking - 18.9
UI-R1-3B (v1) w/ thinking 102 17.8
GUI-R1-3B w/ thinking 114 26.6
UI-R1-3B (v2) w/ thinking 129 29.8
UI-R1-E-3B w/o thinking 28 33.5

Leaderboard: UI-I2E-Bench

Model ScreenSpot UI-I2E-Bench Avg ScreenSpot-Pro Avg
UI-TARS-1.5-7B 88.1 73.2 42.2 67.8
Uground-V1-72B 89.7 76.3 34.3 66.8
UI-TARS-72B 88.4 73.7 38.1 66.7
UI-R1-E-3B 89.2 69.1 33.5 63.9
Uground-V1-7B 87.1 70.3 31.1 62.8
InfiGUI-R1 87.5 69.7 29.6 62.3
UI-TARS-7B 89.5 61.4 35.7 62.2
Qwen2.5-VL-72B 87.1 51.4 43.6 60.7
UI-I2E-VLM-7B 82.5 69.5 23.6 58.5
UI-TARS-2B 82.3 62 27.7 57.3
Qwen2.5-VL-7B 84.7 53.8 29 55.8
OmniParser-V2 72 54.8 39.6 55.5
Uground-V1-2B 78.8 57.4 26.6 54.3
OS-Atlas-7B 82.5 58.6 18.9 53.3
UI-R1-3B 83.3 58.5 17.8 53.2
UGround-7B 74.1 54.2 16.5 48.3
UI-I2E-VLM-4B 70.4 53.4 12.2 45.3
OmniParser 73.9 53.1 8.3 45.1
ShowUI-2B 76.8 41.5 7.7 42
Qwen2.5-VL-3B 55.5 41.7 23.9 41.3
Aguvis-7B 84.4 53.2 22.9 40.4
OS-Atlas-4B 70.1 44.3 3.7 39.4
Qwen2-VL-7B 42.6 48.7 1.6 31
Seeclick 55.8 26.4 1.1 27.8
InternVL2-4B 4.2 0.9 0.3 1.8

Evaluation Code for GUI Grounding

  1. Generation for UI-R1-E-3B:

    model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
        args.model_path,
        torch_dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
        device_map="cpu",
    )
    model = model.to(torch.device(rank))
    model = model.eval()
    processor = AutoProcessor.from_pretrained(ori_processor_path)
    question_template = (
        f"In this UI screenshot, I want to perform the command '{task_prompt}'.\n"
        "Please provide the action to perform (enumerate in ['click'])"
        "and the coordinate where the cursor is moved to(integer) if click is performed.\n"
        "Output the final answer in <answer> </answer> tags directly."
        "The output answer format should be as follows:\n"
        "<answer>[{'action': 'click', 'coordinate': [x, y]}]</answer>\n"
        "Please strictly follow the format."
    )
    query = '<image>\n' + question_template
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image_path}
            ] + [{"type": "text", "text": query}],
        }
    ]
    text = processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    image_inputs, video_inputs = process_vision_info(messages)
    inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
    )
    generated_ids = model.generate(**inputs, max_new_tokens=1024)
    generated_ids_trimmed = [
        out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    response = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )
    response = response[0]
    pred_coord, _ = extract_coord(response)
    
  2. Rescale the predicted coordinate according to the image resize

    image = Image.open(image_path)
    origin_width, origin_height = image.size
    resized_height,resized_width = smart_resize(origin_height,origin_width,max_pixels=12845056)
    scale_x = origin_width / resized_width
    scale_y = origin_height / resized_height
    pred_coord[0] = int(pred_coord[0] * scale_x)
    pred_coord[1] = int(pred_coord[1] * scale_y)
    

    Function smart_resize is from Qwen2VL:

    import math
    def smart_resize(
        height: int, width: int, factor: int = 28, min_pixels: int = 56 * 56, max_pixels: int = 14 * 14 * 4 * 1280
    ):
        """Rescales the image so that the following conditions are met:
    
        1. Both dimensions (height and width) are divisible by 'factor'.
    
        2. The total number of pixels is within the range ['min_pixels', 'max_pixels'].
    
        3. The aspect ratio of the image is maintained as closely as possible.
    
        """
        if height < factor or width < factor:
            raise ValueError(f"height:{height} or width:{width} must be larger than factor:{factor}")
        elif max(height, width) / min(height, width) > 200:
            raise ValueError(
                f"absolute aspect ratio must be smaller than 200, got {max(height, width) / min(height, width)}"
            )
        h_bar = round(height / factor) * factor
        w_bar = round(width / factor) * factor
        if h_bar * w_bar > max_pixels:
            beta = math.sqrt((height * width) / max_pixels)
            h_bar = math.floor(height / beta / factor) * factor
            w_bar = math.floor(width / beta / factor) * factor
        elif h_bar * w_bar < min_pixels:
            beta = math.sqrt(min_pixels / (height * width))
            h_bar = math.ceil(height * beta / factor) * factor
            w_bar = math.ceil(width * beta / factor) * factor
        return h_bar, w_bar
    
Downloads last month
962
Safetensors
Model size
3.75B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LZXzju/Qwen2.5-VL-3B-UI-R1-E

Finetuned
(344)
this model