Ryukijano commited on
Commit
8315a7e
·
verified ·
1 Parent(s): d1fa17c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +96 -317
README.md CHANGED
@@ -1,335 +1,114 @@
1
  ---
 
2
  language:
3
  - en
4
- license: mit
5
- library_name: transformers
6
- # Using text-to-video as the pipeline tag since the model generates action sequences from vision and language inputs
7
- pipeline_tag: text-to-video
8
- datasets:
9
- - lerobot/robot_sim.PickNPlace
10
- - lerobot/so100_strawberry_grape
11
- base_model: NVEagle/eagle_er-qwen3_1_7B-Siglip2_400M_stage1_5_128gpu_er_v7_1mlp_nops
12
  tags:
13
  - robotics
14
- - vision-language-action
15
- - reinforcement-learning
16
  - imitation-learning
17
- - nvidia
18
- - gr00t
19
- - gemma
20
  - diffusion-policy
21
- - lerobot
22
- - robot-learning
23
- - embodied-ai
24
- - humanoid-robots
25
- - robot-manipulation
26
- - computer-vision
27
- - natural-language-processing
28
- - deep-learning
29
- - transformer
30
- - vision-transformer
31
- - flow-matching
32
- - foundation-model
33
- - multi-modal
34
- - human-robot-interaction
35
- - autonomous-robots
36
- - robot-control
37
- - robot-perception
38
- - robot-vision
39
  ---
40
 
41
- # Gemma-GR00T: A Vision-Language-Action Model for Robotic Control
42
-
43
- This is a fine-tuned version of the NVIDIA GR00T N1.5 model, adapted for robotic control tasks using the LeRobot framework. The model combines vision, language, and action generation capabilities to enable robots to perform complex manipulation tasks based on natural language instructions.
44
-
45
- ## Model Description
46
-
47
- Gemma-GR00T is a state-of-the-art multimodal vision-language-action policy that combines Google's Gemma language model with NVIDIA's GR00T robotics framework. This model is specifically designed for advanced robotic manipulation tasks, enabling robots to understand natural language instructions, perceive their environment through vision, and perform precise manipulation actions.
48
-
49
- ## Model Details
50
-
51
- - **Model type:** Vision-Language-Action (VLA) model
52
- - **Base Model:** [NVEagle/eagle_er-qwen3_1_7B-Siglip2_400M_stage1_5_128gpu_er_v7_1mlp_nops](https://huggingface.co/NVEagle/eagle_er-qwen3_1_7B-Siglip2_400M_stage1_5_128gpu_er_v7_1mlp_nops)
53
- - **Task:** text-to-video (robot action generation from vision and language)
54
- - **Training Data:** Trained on LeRobot datasets using the `fourier_gr1_arms_only` configuration
55
- - **Framework:** PyTorch with Hugging Face Transformers
56
- - **Related Models:** [NVIDIA GR00T-N1.5-3B](https://huggingface.co/nvidia/GR00T-N1.5-3B), [LeRobot Models](https://huggingface.co/lerobot)
57
- - **Related Datasets:** [LeRobot Datasets](https://huggingface.co/lerobot/datasets)
58
-
59
- ### Model Architecture
60
-
61
- The model is built on a sophisticated multimodal architecture that combines state-of-the-art vision and language models for robotic control:
62
-
63
- 1. **Backbone**: `Eagle2_5_VLForConditionalGeneration`
64
- - A powerful vision-language model that processes both visual and textual inputs
65
- - Integrates vision and language representations for multimodal understanding
66
-
67
- 2. **Text Encoder**: `Qwen3-1.7B`
68
- - Base Model: [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B)
69
- - Type: Causal Language Model
70
- - Parameters: 1.7B
71
- - Layers: 28
72
- - Attention: 16 heads for Q, 8 heads for KV (GQA)
73
- - Context Length: 32,768 tokens
74
- - Features:
75
- - Strong reasoning and instruction-following capabilities
76
- - Optimized for long-context understanding
77
- - Supports complex language understanding and generation
78
-
79
- 3. **Vision Encoder**: `SigLIP` (Sigmoid Loss for Language-Image Pre-training)
80
- - Base Model: [google/siglip-base-patch16-224](https://huggingface.co/google/siglip-base-patch16-224)
81
- - Type: Vision Transformer (ViT)
82
- - Patch Size: 16x16
83
- - Image Size: 224x224
84
- - Hidden Size: 768
85
- - Layers: 12
86
- - Attention Heads: 12
87
- - Features:
88
- - Strong visual representation learning
89
- - Excellent zero-shot classification capabilities
90
- - Robust to various visual domains
91
-
92
- 4. **Action Head**: Diffusion-based Policy
93
- - Type: Flow-matching action head
94
- - Architecture: 4-layer transformer (ScaledDP)
95
- - Hidden Size: 512
96
- - Feed-Forward Size: 2,048
97
- - Attention Heads: 8
98
- - Features:
99
- - Generates smooth, continuous actions for robotic control
100
- - Uses diffusion process for action generation
101
-
102
- ## Training & Evaluation
103
-
104
- ### Training Performance
105
-
106
- - **Total Training Steps**: 30,000
107
- - **Final Epoch**: 114.5
108
- - **Initial Loss**: 1.27
109
- - **Final Loss**: 0.11
110
- - **Learning Rate**: Warmup to 1e-5 with gradual decay
111
- - **Gradient Norm**: Stabilized around 0.3-1.0 (initial: 11.1)
112
-
113
- ### Recommended Evaluation Metrics
114
-
115
- #### Task Performance
116
- - **Success Rate**: Percentage of successful task completions
117
- - **Path Length**: Efficiency of movement (shorter paths are better)
118
- - **Smoothness**: L2 norm of action derivatives (lower is smoother)
119
- - **Goal Distance**: Final distance to target position
120
- - **Success Rate at k (SR@k)**: Success rate within k attempts
121
-
122
- #### Model Accuracy
123
- - **Action MSE**: Mean squared error of predicted vs. ground truth actions
124
- - **Per-Joint Position Error**: Error for each degree of freedom
125
- - **Gripper Accuracy**: Binary classification of gripper state
126
- - **Trajectory Error**: Dynamic Time Warping (DTW) distance from reference
127
-
128
- #### System Efficiency
129
- - **Inference Time**: Per-step latency (ms)
130
- - **Memory Usage**: Peak GPU memory consumption (GB)
131
- - **FLOPS**: Computational requirements
132
- - **Throughput**: Steps/second during inference
133
-
134
- #### Robustness
135
- - **Success Rate under Noise**: Performance with added sensor noise
136
- - **Generalization**: Performance on unseen objects/scenes
137
- - **Failure Mode Analysis**: Categorization of common failures
138
- - **Recovery Rate**: Ability to recover from perturbations
139
-
140
- ### Evaluation Protocol
141
-
142
- 1. **Test Environments**
143
- - Fixed initial conditions
144
- - Multiple random seeds (recommended: 5+)
145
- - Human baseline comparison
146
- - Ablation studies
147
-
148
- 2. **Visualization**
149
- - Trajectory plots (ground truth vs predicted)
150
- - Attention heatmaps
151
- - Failure case analysis
152
- - Action distribution plots
153
-
154
- 3. **Reporting**
155
- - Mean and standard deviation across seeds
156
- - Statistical significance testing
157
- - Compute requirements (GPU hours, memory)
158
- - Hyperparameter sensitivity analysis
159
- - Processes both visual and language conditioning
160
-
161
- 5. **Training Configuration**:
162
- - Optimizer: AdamW (lr=1e-4, weight_decay=1e-6)
163
- - Diffusion Steps: 100
164
- - Chunk Size: 16
165
- - Action Steps: 8
166
- - Observation Steps: 1
167
-
168
- The model processes visual inputs through the SigLIP vision encoder and textual instructions through the Qwen3-1.7B language model, then fuses these representations in the Eagle2.5 backbone to generate precise control actions via the diffusion-based policy head. The architecture is specifically designed for real-time robotic control with low-latency inference.
169
-
170
- ## Uses
171
-
172
- ### Direct Use
173
-
174
- This model is part of the [Gemma-GR00T](https://github.com/Ryukijano/Gemma-Grook) project and is designed for research and development of robotic manipulation systems. It can be used for:
175
-
176
- - Robotic arm manipulation tasks (pick-and-place, assembly, etc.)
177
- - Sim-to-real transfer learning in robotics
178
- - Multimodal robotic control with natural language instructions
179
- - Research in reinforcement and imitation learning for robotics
180
- - Integration with the [LeRobot](https://github.com/huggingface/lerobot) ecosystem
181
-
182
- ### Related Projects
183
-
184
- - [LeRobot](https://github.com/huggingface/lerobot): The base framework used for training
185
- - [GR00T](https://developer.nvidia.com/gr00t): NVIDIA's foundation model for humanoid robots
186
- - [Gemma](https://huggingface.co/google/gemma-7b): The language model backbone
187
-
188
- ### Out-of-Scope Use
189
-
190
- This model is not intended for:
191
- - Critical systems where failure could lead to harm
192
- - Applications without proper safety measures
193
- - Real-time control without thorough testing
194
- - Non-robotic applications
195
-
196
- ## How to Use
197
-
198
- ### Installation
199
-
200
- ```bash
201
- pip install -r requirements.txt
202
  ```
203
 
204
- ### Loading the Model
 
205
 
 
206
  ```python
207
- from transformers import AutoModelForCausalLM, AutoConfig
208
-
209
- # Load the model
210
- model = AutoModelForCausalLM.from_pretrained("path/to/exported_weights")
211
- ```
212
-
213
- ### Inference Example
214
-
215
- ```python
216
- # Example code for running inference with the model
217
  import torch
218
-
219
- def run_inference(observation, language_instruction):
220
- # Preprocess observation and instruction
221
- inputs = preprocess(observation, language_instruction)
222
-
223
- # Run model inference
224
- with torch.no_grad():
225
- actions = model(**inputs)
226
-
227
- return actions
228
  ```
229
 
230
- ## Training Details
231
-
232
- ### Training Data
233
-
234
- This model was trained using the [LeRobot](https://github.com/huggingface/lerobot) framework, which provides standardized datasets and tools for robotic learning. The training utilized the following configuration:
235
-
236
- - **Primary Datasets:**
237
- - `lerobot/robot_sim.PickNPlace`: Simulated pick and place tasks
238
- - `lerobot/so100_strawberry_grape`: Real-world manipulation tasks
239
- - **Data Configuration:** `fourier_gr1_arms_only`
240
- - **Dataset Documentation:** [LeRobot Datasets](https://huggingface.co/lerobot/datasets)
241
- - **Data Processing:** Follows LeRobot's standardized data pipeline for consistency with other models in the ecosystem
242
- - **Environment:** [Isaac Sim](https://developer.nvidia.com/isaac-sim)
243
- - **Training Steps:** 30,000
244
- - **Batch Size:** 32
245
- - **Learning Rate:** 1e-4
246
- - **Optimizer:** AdamW
247
- - **Weight Decay:** 1e-5
248
- - **Warmup Ratio:** 0.05
249
- - **Hardware:** 3× NVIDIA L40S GPUs
250
- - **Framework:** PyTorch with Hugging Face Transformers
251
-
252
- ### Data Processing
253
-
254
- The model processes the following modalities from the LeRobot dataset:
255
- - **Visual Inputs:** Processed through a vision encoder
256
- - **Proprioception:** Arm joint states and gripper status
257
- - **Actions:** 32-dimensional continuous action space
258
- - **Language Instructions:** Natural language task descriptions
259
-
260
- ### Training Procedure
261
-
262
- The model was trained using a combination of:
263
- - Imitation learning from demonstration data
264
- - Reinforcement learning with PPO
265
- - Behavior cloning
266
-
267
- ## Evaluation
268
-
269
- ### Metrics
270
-
271
- - **Success Rate:** 85% on validation tasks
272
- - **Task Completion:** 90% of tasks completed successfully
273
- - **Generalization:** 75% success on unseen objects
274
-
275
- ### Results
276
-
277
- | Task | Success Rate |
278
- |------|-------------:|
279
- | Pick and Place | 88% |
280
- | Object Stacking | 83% |
281
- | Tool Use | 79% |
282
- | Multi-step Tasks | 72% |
283
-
284
- ## Limitations and Bias
285
-
286
- - The model's performance is highly dependent on the quality and diversity of the training data.
287
- - May not generalize well to completely novel objects or environments.
288
- - Performance may degrade in cluttered or highly dynamic environments.
289
- - Safety mechanisms should be implemented for real-world deployment.
290
-
291
- ## Environmental Impact
292
-
293
- - **Carbon Emissions:** Estimated 120 kg CO2eq
294
- - **Hardware Type:** NVIDIA L40S GPUs
295
- - **Hours used:** 240
296
- - **Cloud Provider:** Private cluster
297
- - **Compute Region:** UK
298
- - **Energy Mix:** 40% renewable
299
-
300
- ## Technical Specifications
301
-
302
- ### Model Architecture
303
-
304
- - **Parameters:** 1.7B
305
- - **Layers:** 16
306
- - **Attention Heads:** 32
307
- - **Hidden Size:** 2048
308
- - **Context Length:** 2048 tokens
309
-
310
- ### Hardware and Software
311
-
312
- - **Training Hardware:** 3× NVIDIA L40S GPUs
313
- - **Inference Hardware:** NVIDIA L4 or better
314
- - **Framework:** PyTorch 2.7.1+
315
- - **CUDA Version:** 12.4
316
-
317
- ## Citation
318
-
319
- ```bibtex
320
- @misc{gemmagroot2024,
321
- title={Gemma-GR00T: Multimodal Robotic Manipulation with Language Models},
322
- author={Your Name},
323
- year={2024},
324
- publisher={GitHub},
325
- howpublished={\url{https://github.com/Ryukijano/Gemma-Grook}},
326
- }
327
  ```
328
 
329
- ## Model Card Contact
330
-
331
- For questions or comments about this model, please open an issue in the [GitHub repository](https://github.com/Ryukijano/Gemma-Grook/issues).
332
-
333
- ## License
334
-
335
- This model is licensed under the MIT License. See the [LICENSE](LICENSE) file for more details.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: apache-2.0
3
  language:
4
  - en
 
 
 
 
 
 
 
 
5
  tags:
6
  - robotics
7
+ - vla
8
+ - lerobot
9
  - imitation-learning
 
 
 
10
  - diffusion-policy
11
+ - gemma-3
12
+ - siglip
13
+ - scaledp
14
+ - multimodal
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  ---
16
 
17
+ # Gemma-Le: SigLIP + Gemma 3 + ScaleDP (LeRobot VLA Policy)
18
+
19
+ Gemma-Le is a compact Vision-Language-Action policy for robotic manipulation built on top of LeRobot.
20
+ It replaces NV Eagle with standard Hugging Face components:
21
+
22
+ - SigLIP `google/siglip-so400m-patch14-384` for vision
23
+ - Gemma 3 `google/gemma-3-4b-it` for language/reasoning (with LoRA PEFT)
24
+ - ScaleDP (Scalable Diffusion Transformer) as the action head
25
+
26
+ This repo hosts exported checkpoints trained on LeRobot-format datasets (e.g., `robot_sim.PickNPlace`).
27
+
28
+ ## Architecture
29
+ - Vision: SigLIP ViT encoder (384px, patch14), pooled embedding
30
+ - Text: Gemma 3 4B-IT, mean-pooled hidden states
31
+ - LoRA: rank=16 on `[q_proj, k_proj, v_proj, o_proj]`
32
+ - Fusion: MLP projects [vision || text] -> `conditioning_dim=768`
33
+ - Action head: ScaleDP Transformer (layers=12, d_model=320, heads=8, ff=1280) predicts diffusion noise
34
+ - Temporal context: `chunk_size=8`; diffusion steps `num_diffusion_steps=50`
35
+ - Mixed precision: AMP auto-selects bf16/fp16; bf16 uses no GradScaler
36
+
37
+ ## Default config (excerpt)
38
+ ```yaml
39
+ vision_model_id: google/siglip-so400m-patch14-384
40
+ text_model_id: google/gemma-3-4b-it
41
+ image_features: ["observation.images.ego_view"]
42
+ action_feature: "action"
43
+ chunk_size: 8
44
+ num_diffusion_steps: 50
45
+ conditioning_dim: 768
46
+ plan_update_interval: 10
47
+ scaledp_num_layers: 12
48
+ scaledp_dim_model: 320
49
+ scaledp_num_heads: 8
50
+ scaledp_dim_feedforward: 1280
51
+ use_lora: true
52
+ lora_rank: 16
53
+ lora_target_modules: ["q_proj","k_proj","v_proj","o_proj"]
54
+ optimizer_lr: 1e-4
55
+ optimizer_weight_decay: 1e-6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
  ```
57
 
58
+ ## Usage (with this repo’s LeRobot fork)
59
+ Install deps and set `PYTHONPATH` to include `lerobot` in this repository.
60
 
61
+ Evaluation-style load:
62
  ```python
 
 
 
 
 
 
 
 
 
 
63
  import torch
64
+ from lerobot.common.policies.gemma_le.modeling_gemma_le import GemmaLePolicy
65
+ from huggingface_hub import snapshot_download
66
+ ckpt_dir = snapshot_download(repo_id="Ryukijano/gemma-groot", revision="main")
67
+ policy = GemmaLePolicy.from_pretrained(ckpt_dir, torch_dtype=torch.bfloat16)
68
+ policy.eval()
 
 
 
 
 
69
  ```
70
 
71
+ Training entrypoint:
72
+ ```bash
73
+ python lerobot/lerobot/scripts/train.py \
74
+ --policy.type gemma_le \
75
+ --dataset.repo_id local/robot_sim.PickNPlace \
76
+ --dataset.root /path/to/robot_sim.PickNPlace \
77
+ --dataset.episodes "[0,1,2,3,4]" \
78
+ --batch_size 3 \
79
+ --steps 200000 \
80
+ --log_freq 100 \
81
+ --save_freq 5000 \
82
+ --policy.vision_model_id google/siglip-so400m-patch14-384 \
83
+ --policy.text_model_id google/gemma-3-4b-it \
84
+ --policy.use_amp true \
85
+ --progress_bar true \
86
+ --push_to_hub true \
87
+ --push_repo_id Ryukijano/gemma-groot \
88
+ --push_branch main \
89
+ --push_exist_ok true
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90
  ```
91
 
92
+ ### Slurm (3× L40)
93
+ See `submit_job.sh`. Ensure caches on scratch and set:
94
+ - `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`
95
+ - `HF_HOME`, `HUGGINGFACE_HUB_CACHE`, `TRANSFORMERS_CACHE` to scratch
96
+
97
+ ## Checkpoints
98
+ - Latest runs uploaded under `runs/<date>/<run>/<step>` in this repo.
99
+ - Example: `runs/2025-08-12/13-06-07_gemma_le/020000/`.
100
+
101
+ ## Data
102
+ - LeRobotDataset (parquet + mp4 + metadata). Single RGB view: `observation.images.ego_view`. Targets: `action`.
103
+ - Timestamp tolerance is auto-relaxed to `max(tolerance_s, 1/fps + 1e-4)` during training for robust decoding.
104
+
105
+ ## Notes
106
+ - Base model access: `google/gemma-3-4b-it` may require TOS.
107
+ - Intended for imitation learning; ThinkAct-style planning can be layered on top.
108
+
109
+ ## Citations
110
+ - LeRobot: https://github.com/huggingface/lerobot
111
+ - Gemma 3: https://ai.google.dev/gemma
112
+ - SigLIP: https://huggingface.co/timm/ViT-SigLIP
113
+ - Diffusion Policy: https://arxiv.org/abs/2303.04137
114
+ ```