File size: 3,684 Bytes
cf39c5d
8315a7e
cf39c5d
 
 
 
8315a7e
 
cf39c5d
17c92b9
8315a7e
 
 
 
cf39c5d
 
8315a7e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d1fa17c
 
8315a7e
 
cf39c5d
8315a7e
cf39c5d
d1fa17c
8315a7e
 
 
 
 
cf39c5d
 
8315a7e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d1fa17c
 
8315a7e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
---
license: apache-2.0
language:
- en
tags:
- robotics
- vla
- lerobot
- imitation-learning
- diffusion-policy
- gemma-3
- siglip
- scaledp
- multimodal
---

# Gemma-Le: SigLIP + Gemma 3 + ScaleDP (LeRobot VLA Policy)

Gemma-Le is a compact Vision-Language-Action policy for robotic manipulation built on top of LeRobot.
It replaces NV Eagle with standard Hugging Face components:

- SigLIP `google/siglip-so400m-patch14-384` for vision
- Gemma 3 `google/gemma-3-4b-it` for language/reasoning (with LoRA PEFT)
- ScaleDP (Scalable Diffusion Transformer) as the action head

This repo hosts exported checkpoints trained on LeRobot-format datasets (e.g., `robot_sim.PickNPlace`).

## Architecture
- Vision: SigLIP ViT encoder (384px, patch14), pooled embedding
- Text: Gemma 3 4B-IT, mean-pooled hidden states
- LoRA: rank=16 on `[q_proj, k_proj, v_proj, o_proj]`
- Fusion: MLP projects [vision || text] -> `conditioning_dim=768`
- Action head: ScaleDP Transformer (layers=12, d_model=320, heads=8, ff=1280) predicts diffusion noise
- Temporal context: `chunk_size=8`; diffusion steps `num_diffusion_steps=50`
- Mixed precision: AMP auto-selects bf16/fp16; bf16 uses no GradScaler

## Default config (excerpt)
```yaml
vision_model_id: google/siglip-so400m-patch14-384
text_model_id:   google/gemma-3-4b-it
image_features:  ["observation.images.ego_view"]
action_feature:  "action"
chunk_size: 8
num_diffusion_steps: 50
conditioning_dim: 768
plan_update_interval: 10
scaledp_num_layers: 12
scaledp_dim_model: 320
scaledp_num_heads: 8
scaledp_dim_feedforward: 1280
use_lora: true
lora_rank: 16
lora_target_modules: ["q_proj","k_proj","v_proj","o_proj"]
optimizer_lr: 1e-4
optimizer_weight_decay: 1e-6
```

## Usage (with this repo’s LeRobot fork)
Install deps and set `PYTHONPATH` to include `lerobot` in this repository.

Evaluation-style load:
```python
import torch
from lerobot.common.policies.gemma_le.modeling_gemma_le import GemmaLePolicy
from huggingface_hub import snapshot_download
ckpt_dir = snapshot_download(repo_id="Ryukijano/gemma-groot", revision="main")
policy = GemmaLePolicy.from_pretrained(ckpt_dir, torch_dtype=torch.bfloat16)
policy.eval()
```

Training entrypoint:
```bash
python lerobot/lerobot/scripts/train.py \
  --policy.type gemma_le \
  --dataset.repo_id local/robot_sim.PickNPlace \
  --dataset.root /path/to/robot_sim.PickNPlace \
  --dataset.episodes "[0,1,2,3,4]" \
  --batch_size 3 \
  --steps 200000 \
  --log_freq 100 \
  --save_freq 5000 \
  --policy.vision_model_id google/siglip-so400m-patch14-384 \
  --policy.text_model_id google/gemma-3-4b-it \
  --policy.use_amp true \
  --progress_bar true \
  --push_to_hub true \
  --push_repo_id Ryukijano/gemma-groot \
  --push_branch main \
  --push_exist_ok true
```

### Slurm (3× L40)
See `submit_job.sh`. Ensure caches on scratch and set:
- `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`
- `HF_HOME`, `HUGGINGFACE_HUB_CACHE`, `TRANSFORMERS_CACHE` to scratch

## Checkpoints
- Latest runs uploaded under `runs/<date>/<run>/<step>` in this repo.
- Example: `runs/2025-08-12/13-06-07_gemma_le/020000/`.

## Data
- LeRobotDataset (parquet + mp4 + metadata). Single RGB view: `observation.images.ego_view`. Targets: `action`.
- Timestamp tolerance is auto-relaxed to `max(tolerance_s, 1/fps + 1e-4)` during training for robust decoding.

## Notes
- Base model access: `google/gemma-3-4b-it` may require TOS.
- Intended for imitation learning; ThinkAct-style planning can be layered on top.

## Citations
- LeRobot: https://github.com/huggingface/lerobot
- Gemma 3: https://ai.google.dev/gemma
- SigLIP: https://huggingface.co/timm/ViT-SigLIP
- Diffusion Policy: https://arxiv.org/abs/2303.04137
```