Gemma-3n-E2B-it Android Control LoRA Fine-tuned Model
Model Overview
This model is a fine-tuned version of Google's gemma-3n-E2B-it
base model with LoRA adaptation for Android UI control tasks.
Training Data
- Dataset: OfficerChul/Android-Control-84k
- Data Format: Mobile UI screenshots paired with user instructions to perform appropriate actions (click, scroll, input, etc.)
Training Data Format Example
{
"messages": [
{
"role": "system",
"content": "You are a helpful assistant that can identify what action to perform on mobile UI Screenshot given the user instruction."
},
{
"role": "user",
"content": "<image>Click on the Recording 2"
},
{
"role": "assistant",
"content": "{\"action_type\": \"click\", \"x\": 561, \"y\": 535}"
}
],
"images": ["and_ctrl/out_episode_18557_step_001.png"]
}
Training Method
LoRA fine-tuning performed using LLaMA-Factory framework
1. Training Configuration (gemma3n-e2b-it.yaml
)
- Base Model:
google/gemma-3n-E2B-it
- Training Method: LoRA (Low-Rank Adaptation)
- LoRA Configuration:
- Rank: 32
- Target modules:
q_proj, k_proj, v_proj, o_proj
- Training Parameters:
- Batch size: 4 (gradient accumulation: 48)
- Learning rate: 2e-5
- Epochs: 5
- LR scheduler: Cosine
- Optimizer: AdamW (fused)
- Precision: bf16
- Additional Settings:
- Gradient checkpointing enabled
- Vision tower, multi-modal projector, and language model all trainable
- DeepSpeed ZeRO-2 utilized
2. Model Merging (gemma3n-e2b-it_lora_sft_merge.yaml
)
Merged trained LoRA adapter with base model:
- Base Model:
google/gemma-3n-E2B-it
Supported Action Types
click
: Click on specific coordinateslong_press
: Long press actionscroll
: Scroll (up/down)input_text
: Text inputnavigate_back
: Navigate backnavigate_home
: Navigate to home screenopen_app
: Open applicationwait
: Wait action
Usage
The merged model can be directly loaded using the Hugging Face Transformers library.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "OfficerChul/gemma-3n-E2B-it-Android-Control-84k"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path,
trust_remote_code=True
)
Evaluation Results
Model | Action Type Accuracy | Click L2 Distance | Input Text Match | Scroll Direction Match | Avg. Episode Accuracy | Malformed JSON | Execution Time (s) | Inference Time (s) |
---|---|---|---|---|---|---|---|---|
Qwen/Qwen3-VL-30B-A3B-Instruct | 0.9090 | 705.72 (n=812) | 0.8264 (n=121) | 0.3226 (n=248) | 0.9063 | 110 (5.5%) | 1101.43 | 485.12 |
Qwen/Qwen2.5-VL-7B-Instruct | 0.6125 | 59.89 (n=544) | 0.8197 (n=61) | 0.3243 (n=111) | 0.6163 | 499 (24.9%) | 720.88 | 580.92 |
Qwen/Qwen2.5-VL-3B-Instruct | 0.6645 | 88.21 (n=165) | 0.7889 (n=90) | 0.3519 (n=108) | 0.6615 | 440 (22.0%) | 676.76 | 536.27 |
OfficerChul/Qwen2.5-VL-7B-Instruct-Android-Control-5a | 0.9970 | 427.30 (n=1466) | 0.9434 (n=159) | 0.9775 (n=267) | 0.9974 | 0 (0.0%) | 1086.97 | 581.82 |
OfficerChul/Qwen2.5-VL-3B-Instruct-Android-Control-5a | 0.9965 | 446.54 (n=1467) | 0.9363 (n=157) | 0.9738 (n=267) | 0.9976 | 1 (0.1%) | 672.88 | 530.95 |
OfficerChul/InfiGUI-G1-7B-Android-Control-5a | 0.9970 | 466.24 (n=1466) | 0.9434 (n=159) | 0.9775 (n=267) | 0.9968 | 1 (0.1%) | 897.58 | 552.23 |
OfficerChul/InfiGUI-G1-3B-Android-Control-5a | 0.9980 | 449.73 (n=1467) | 0.9625 (n=160) | 0.9625 (n=267) | 0.9983 | 0 (0.0%) | 722.63 | 529.57 |
InfiX-ai/InfiGUI-G1-7B | 0.6715 | 82.21 (n=821) | 0.8000 (n=70) | 0.2268 (n=194) | 0.6763 | 457 (22.9%) | 698.77 | 557.50 |
InfiX-ai/InfiGUI-G1-3B | 0.8745 | 102.39 (n=1020) | 0.7700 (n=100) | 0.2299 (n=174) | 0.8910 | 78 (3.9%) | 702.93 | 559.65 |
OfficerChul/gemma-3n-E2B-it-Android-Control-84k | 0.5819 | 985.82 (n=123) | 0.8596 (n=114) | 0.2159 (n=88) | 0.5781 | 0 (0.0%) | 322.95 | 159.23 |
License
Follows the license terms of the Google Gemma model.
Notes
- This model was developed for research purposes in mobile UI automation and accessibility enhancement
- Proper validation is required when using in production environments
- Downloads last month
- 150