File size: 3,684 Bytes
cf39c5d 8315a7e cf39c5d 8315a7e cf39c5d 17c92b9 8315a7e cf39c5d 8315a7e d1fa17c 8315a7e cf39c5d 8315a7e cf39c5d d1fa17c 8315a7e cf39c5d 8315a7e d1fa17c 8315a7e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 |
---
license: apache-2.0
language:
- en
tags:
- robotics
- vla
- lerobot
- imitation-learning
- diffusion-policy
- gemma-3
- siglip
- scaledp
- multimodal
---
# Gemma-Le: SigLIP + Gemma 3 + ScaleDP (LeRobot VLA Policy)
Gemma-Le is a compact Vision-Language-Action policy for robotic manipulation built on top of LeRobot.
It replaces NV Eagle with standard Hugging Face components:
- SigLIP `google/siglip-so400m-patch14-384` for vision
- Gemma 3 `google/gemma-3-4b-it` for language/reasoning (with LoRA PEFT)
- ScaleDP (Scalable Diffusion Transformer) as the action head
This repo hosts exported checkpoints trained on LeRobot-format datasets (e.g., `robot_sim.PickNPlace`).
## Architecture
- Vision: SigLIP ViT encoder (384px, patch14), pooled embedding
- Text: Gemma 3 4B-IT, mean-pooled hidden states
- LoRA: rank=16 on `[q_proj, k_proj, v_proj, o_proj]`
- Fusion: MLP projects [vision || text] -> `conditioning_dim=768`
- Action head: ScaleDP Transformer (layers=12, d_model=320, heads=8, ff=1280) predicts diffusion noise
- Temporal context: `chunk_size=8`; diffusion steps `num_diffusion_steps=50`
- Mixed precision: AMP auto-selects bf16/fp16; bf16 uses no GradScaler
## Default config (excerpt)
```yaml
vision_model_id: google/siglip-so400m-patch14-384
text_model_id: google/gemma-3-4b-it
image_features: ["observation.images.ego_view"]
action_feature: "action"
chunk_size: 8
num_diffusion_steps: 50
conditioning_dim: 768
plan_update_interval: 10
scaledp_num_layers: 12
scaledp_dim_model: 320
scaledp_num_heads: 8
scaledp_dim_feedforward: 1280
use_lora: true
lora_rank: 16
lora_target_modules: ["q_proj","k_proj","v_proj","o_proj"]
optimizer_lr: 1e-4
optimizer_weight_decay: 1e-6
```
## Usage (with this repo’s LeRobot fork)
Install deps and set `PYTHONPATH` to include `lerobot` in this repository.
Evaluation-style load:
```python
import torch
from lerobot.common.policies.gemma_le.modeling_gemma_le import GemmaLePolicy
from huggingface_hub import snapshot_download
ckpt_dir = snapshot_download(repo_id="Ryukijano/gemma-groot", revision="main")
policy = GemmaLePolicy.from_pretrained(ckpt_dir, torch_dtype=torch.bfloat16)
policy.eval()
```
Training entrypoint:
```bash
python lerobot/lerobot/scripts/train.py \
--policy.type gemma_le \
--dataset.repo_id local/robot_sim.PickNPlace \
--dataset.root /path/to/robot_sim.PickNPlace \
--dataset.episodes "[0,1,2,3,4]" \
--batch_size 3 \
--steps 200000 \
--log_freq 100 \
--save_freq 5000 \
--policy.vision_model_id google/siglip-so400m-patch14-384 \
--policy.text_model_id google/gemma-3-4b-it \
--policy.use_amp true \
--progress_bar true \
--push_to_hub true \
--push_repo_id Ryukijano/gemma-groot \
--push_branch main \
--push_exist_ok true
```
### Slurm (3× L40)
See `submit_job.sh`. Ensure caches on scratch and set:
- `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`
- `HF_HOME`, `HUGGINGFACE_HUB_CACHE`, `TRANSFORMERS_CACHE` to scratch
## Checkpoints
- Latest runs uploaded under `runs/<date>/<run>/<step>` in this repo.
- Example: `runs/2025-08-12/13-06-07_gemma_le/020000/`.
## Data
- LeRobotDataset (parquet + mp4 + metadata). Single RGB view: `observation.images.ego_view`. Targets: `action`.
- Timestamp tolerance is auto-relaxed to `max(tolerance_s, 1/fps + 1e-4)` during training for robust decoding.
## Notes
- Base model access: `google/gemma-3-4b-it` may require TOS.
- Intended for imitation learning; ThinkAct-style planning can be layered on top.
## Citations
- LeRobot: https://github.com/huggingface/lerobot
- Gemma 3: https://ai.google.dev/gemma
- SigLIP: https://huggingface.co/timm/ViT-SigLIP
- Diffusion Policy: https://arxiv.org/abs/2303.04137
``` |