Gemma-Le: SigLIP + Gemma 3 + ScaleDP (LeRobot VLA Policy)

Gemma-Le is a compact Vision-Language-Action policy for robotic manipulation built on top of LeRobot. It replaces NV Eagle with standard Hugging Face components:

SigLIP google/siglip-so400m-patch14-384 for vision
Gemma 3 google/gemma-3-4b-it for language/reasoning (with LoRA PEFT)
ScaleDP (Scalable Diffusion Transformer) as the action head

This repo hosts exported checkpoints trained on LeRobot-format datasets (e.g., robot_sim.PickNPlace).

Architecture

Vision: SigLIP ViT encoder (384px, patch14), pooled embedding
Text: Gemma 3 4B-IT, mean-pooled hidden states
LoRA: rank=16 on [q_proj, k_proj, v_proj, o_proj]
Fusion: MLP projects [vision || text] -> conditioning_dim=768
Action head: ScaleDP Transformer (layers=12, d_model=320, heads=8, ff=1280) predicts diffusion noise
Temporal context: chunk_size=8; diffusion steps num_diffusion_steps=50
Mixed precision: AMP auto-selects bf16/fp16; bf16 uses no GradScaler

Default config (excerpt)

vision_model_id: google/siglip-so400m-patch14-384
text_model_id:   google/gemma-3-4b-it
image_features:  ["observation.images.ego_view"]
action_feature:  "action"
chunk_size: 8
num_diffusion_steps: 50
conditioning_dim: 768
plan_update_interval: 10
scaledp_num_layers: 12
scaledp_dim_model: 320
scaledp_num_heads: 8
scaledp_dim_feedforward: 1280
use_lora: true
lora_rank: 16
lora_target_modules: ["q_proj","k_proj","v_proj","o_proj"]
optimizer_lr: 1e-4
optimizer_weight_decay: 1e-6

Usage (with this repo’s LeRobot fork)

Install deps and set PYTHONPATH to include lerobot in this repository.

Evaluation-style load:

import torch
from lerobot.common.policies.gemma_le.modeling_gemma_le import GemmaLePolicy
from huggingface_hub import snapshot_download
ckpt_dir = snapshot_download(repo_id="Ryukijano/gemma-groot", revision="main")
policy = GemmaLePolicy.from_pretrained(ckpt_dir, torch_dtype=torch.bfloat16)
policy.eval()

Training entrypoint:

python lerobot/lerobot/scripts/train.py \
  --policy.type gemma_le \
  --dataset.repo_id local/robot_sim.PickNPlace \
  --dataset.root /path/to/robot_sim.PickNPlace \
  --dataset.episodes "[0,1,2,3,4]" \
  --batch_size 3 \
  --steps 200000 \
  --log_freq 100 \
  --save_freq 5000 \
  --policy.vision_model_id google/siglip-so400m-patch14-384 \
  --policy.text_model_id google/gemma-3-4b-it \
  --policy.use_amp true \
  --progress_bar true \
  --push_to_hub true \
  --push_repo_id Ryukijano/gemma-groot \
  --push_branch main \
  --push_exist_ok true

Slurm (3× L40)

See submit_job.sh. Ensure caches on scratch and set:

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
HF_HOME, HUGGINGFACE_HUB_CACHE, TRANSFORMERS_CACHE to scratch

Checkpoints

Latest runs uploaded under runs/<date>/<run>/<step> in this repo.
Example: runs/2025-08-12/13-06-07_gemma_le/020000/.

Data

LeRobotDataset (parquet + mp4 + metadata). Single RGB view: observation.images.ego_view. Targets: action.
Timestamp tolerance is auto-relaxed to max(tolerance_s, 1/fps + 1e-4) during training for robust decoding.

Notes

Base model access: google/gemma-3-4b-it may require TOS.
Intended for imitation learning; ThinkAct-style planning can be layered on top.

Citations

LeRobot: https://github.com/huggingface/lerobot
Gemma 3: https://ai.google.dev/gemma
SigLIP: https://huggingface.co/timm/ViT-SigLIP
Diffusion Policy: https://arxiv.org/abs/2303.04137

Downloads last month: 9

Safetensors

Model size

5B params

Tensor type

F32

BF16

Video Preview

Robotics