Ryukijano commited on
Commit
7ec1b17
·
verified ·
1 Parent(s): ed6cb09

Add Gemma-GR00T model weights

Browse files
Files changed (1) hide show
  1. README.md +121 -28
README.md CHANGED
@@ -3,29 +3,44 @@ language:
3
  - en
4
  license: mit
5
  library_name: transformers
6
- pipeline_tag: reinforcement-learning
 
7
  datasets:
8
  - lerobot/robot_sim.PickNPlace
9
  - lerobot/so100_strawberry_grape
10
  base_model: NVEagle/eagle_er-qwen3_1_7B-Siglip2_400M_stage1_5_128gpu_er_v7_1mlp_nops
11
  tags:
12
  - robotics
 
13
  - reinforcement-learning
14
  - imitation-learning
15
- - gemma
16
- - gr00t
17
  - nvidia
18
- - lerobot
19
- - vision-language-action
20
- - robot-manipulation
21
- - gemma-le
22
  - diffusion-policy
23
- - le-robot
24
  - robot-learning
25
  - embodied-ai
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
  ---
27
 
28
- # Gemma-GR00T: Multimodal Robotic Manipulation with Language Models
 
 
29
 
30
  ## Model Description
31
 
@@ -33,46 +48,124 @@ Gemma-GR00T is a state-of-the-art multimodal vision-language-action policy that
33
 
34
  ## Model Details
35
 
36
- - **Developed by:** [Gyanateet Dutta](https://huggingface.co/Ryukijano)
37
- - **Model type:** Vision-Language-Action Policy
38
- - **Language(s) (NLP):** English
39
- - **License:** MIT
40
- - **Finetuned from model:** [NVEagle/eagle_er-qwen3_1_7B-Siglip2_400M_stage1_5_128gpu_er_v7_1mlp_nops](https://huggingface.co/NVEagle/eagle_er-qwen3_1_7B-Siglip2_400M_stage1_5_128gpu_er_v7_1mlp_nops)
41
  - **Training Data:** Trained on LeRobot datasets using the `fourier_gr1_arms_only` configuration
42
  - **Framework:** PyTorch with Hugging Face Transformers
43
- - **Related Models:** [LeRobot Models](https://huggingface.co/lerobot)
44
  - **Related Datasets:** [LeRobot Datasets](https://huggingface.co/lerobot/datasets)
45
 
46
  ### Model Architecture
47
 
48
  The model is built on a sophisticated multimodal architecture that combines state-of-the-art vision and language models for robotic control:
49
 
50
- 1. **Backbone**: `Eagle2_5_VLForConditionalGeneration` - A powerful vision-language model that processes both visual and textual inputs.
 
 
51
 
52
  2. **Text Encoder**: `Qwen3-1.7B`
 
53
  - Type: Causal Language Model
54
  - Parameters: 1.7B
55
  - Layers: 28
56
  - Attention: 16 heads for Q, 8 heads for KV (GQA)
57
  - Context Length: 32,768 tokens
58
- - Features: Strong reasoning and instruction-following capabilities
 
 
 
59
 
60
  3. **Vision Encoder**: `SigLIP` (Sigmoid Loss for Language-Image Pre-training)
 
61
  - Type: Vision Transformer (ViT)
62
- - Patch Size: 14x14
63
- - Hidden Size: 1,152
64
- - Layers: 27
65
- - Attention Heads: 16
66
- - Features: Strong visual representation learning with localization capabilities
 
 
 
 
67
 
68
  4. **Action Head**: Diffusion-based Policy
69
  - Type: Flow-matching action head
70
- - Architecture: 16-layer transformer
71
- - Hidden Size: 1,024
72
- - Attention Heads: 32
73
- - Features: Generates smooth, continuous actions for robotic control
74
-
75
- The model processes visual inputs through the SigLIP vision encoder and textual instructions through the Qwen3-1.7B language model, then fuses these representations in the Eagle2.5 backbone to generate precise control actions via the diffusion-based policy head.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
 
77
  ## Uses
78
 
 
3
  - en
4
  license: mit
5
  library_name: transformers
6
+ # Using text-to-video as the pipeline tag since the model generates action sequences from vision and language inputs
7
+ pipeline_tag: text-to-video
8
  datasets:
9
  - lerobot/robot_sim.PickNPlace
10
  - lerobot/so100_strawberry_grape
11
  base_model: NVEagle/eagle_er-qwen3_1_7B-Siglip2_400M_stage1_5_128gpu_er_v7_1mlp_nops
12
  tags:
13
  - robotics
14
+ - vision-language-action
15
  - reinforcement-learning
16
  - imitation-learning
 
 
17
  - nvidia
18
+ - gr00t
19
+ - gemma
 
 
20
  - diffusion-policy
21
+ - lerobot
22
  - robot-learning
23
  - embodied-ai
24
+ - humanoid-robots
25
+ - robot-manipulation
26
+ - computer-vision
27
+ - natural-language-processing
28
+ - deep-learning
29
+ - transformer
30
+ - vision-transformer
31
+ - flow-matching
32
+ - foundation-model
33
+ - multi-modal
34
+ - human-robot-interaction
35
+ - autonomous-robots
36
+ - robot-control
37
+ - robot-perception
38
+ - robot-vision
39
  ---
40
 
41
+ # Gemma-GR00T: A Vision-Language-Action Model for Robotic Control
42
+
43
+ This is a fine-tuned version of the NVIDIA GR00T N1.5 model, adapted for robotic control tasks using the LeRobot framework. The model combines vision, language, and action generation capabilities to enable robots to perform complex manipulation tasks based on natural language instructions.
44
 
45
  ## Model Description
46
 
 
48
 
49
  ## Model Details
50
 
51
+ - **Model type:** Vision-Language-Action (VLA) model
52
+ - **Base Model:** [NVEagle/eagle_er-qwen3_1_7B-Siglip2_400M_stage1_5_128gpu_er_v7_1mlp_nops](https://huggingface.co/NVEagle/eagle_er-qwen3_1_7B-Siglip2_400M_stage1_5_128gpu_er_v7_1mlp_nops)
53
+ - **Task:** text-to-video (robot action generation from vision and language)
 
 
54
  - **Training Data:** Trained on LeRobot datasets using the `fourier_gr1_arms_only` configuration
55
  - **Framework:** PyTorch with Hugging Face Transformers
56
+ - **Related Models:** [NVIDIA GR00T-N1.5-3B](https://huggingface.co/nvidia/GR00T-N1.5-3B), [LeRobot Models](https://huggingface.co/lerobot)
57
  - **Related Datasets:** [LeRobot Datasets](https://huggingface.co/lerobot/datasets)
58
 
59
  ### Model Architecture
60
 
61
  The model is built on a sophisticated multimodal architecture that combines state-of-the-art vision and language models for robotic control:
62
 
63
+ 1. **Backbone**: `Eagle2_5_VLForConditionalGeneration`
64
+ - A powerful vision-language model that processes both visual and textual inputs
65
+ - Integrates vision and language representations for multimodal understanding
66
 
67
  2. **Text Encoder**: `Qwen3-1.7B`
68
+ - Base Model: [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B)
69
  - Type: Causal Language Model
70
  - Parameters: 1.7B
71
  - Layers: 28
72
  - Attention: 16 heads for Q, 8 heads for KV (GQA)
73
  - Context Length: 32,768 tokens
74
+ - Features:
75
+ - Strong reasoning and instruction-following capabilities
76
+ - Optimized for long-context understanding
77
+ - Supports complex language understanding and generation
78
 
79
  3. **Vision Encoder**: `SigLIP` (Sigmoid Loss for Language-Image Pre-training)
80
+ - Base Model: [google/siglip-base-patch16-224](https://huggingface.co/google/siglip-base-patch16-224)
81
  - Type: Vision Transformer (ViT)
82
+ - Patch Size: 16x16
83
+ - Image Size: 224x224
84
+ - Hidden Size: 768
85
+ - Layers: 12
86
+ - Attention Heads: 12
87
+ - Features:
88
+ - Strong visual representation learning
89
+ - Excellent zero-shot classification capabilities
90
+ - Robust to various visual domains
91
 
92
  4. **Action Head**: Diffusion-based Policy
93
  - Type: Flow-matching action head
94
+ - Architecture: 4-layer transformer (ScaledDP)
95
+ - Hidden Size: 512
96
+ - Feed-Forward Size: 2,048
97
+ - Attention Heads: 8
98
+ - Features:
99
+ - Generates smooth, continuous actions for robotic control
100
+ - Uses diffusion process for action generation
101
+
102
+ ## Training & Evaluation
103
+
104
+ ### Training Performance
105
+
106
+ - **Total Training Steps**: 30,000
107
+ - **Final Epoch**: 114.5
108
+ - **Initial Loss**: 1.27
109
+ - **Final Loss**: 0.11
110
+ - **Learning Rate**: Warmup to 1e-5 with gradual decay
111
+ - **Gradient Norm**: Stabilized around 0.3-1.0 (initial: 11.1)
112
+
113
+ ### Recommended Evaluation Metrics
114
+
115
+ #### Task Performance
116
+ - **Success Rate**: Percentage of successful task completions
117
+ - **Path Length**: Efficiency of movement (shorter paths are better)
118
+ - **Smoothness**: L2 norm of action derivatives (lower is smoother)
119
+ - **Goal Distance**: Final distance to target position
120
+ - **Success Rate at k (SR@k)**: Success rate within k attempts
121
+
122
+ #### Model Accuracy
123
+ - **Action MSE**: Mean squared error of predicted vs. ground truth actions
124
+ - **Per-Joint Position Error**: Error for each degree of freedom
125
+ - **Gripper Accuracy**: Binary classification of gripper state
126
+ - **Trajectory Error**: Dynamic Time Warping (DTW) distance from reference
127
+
128
+ #### System Efficiency
129
+ - **Inference Time**: Per-step latency (ms)
130
+ - **Memory Usage**: Peak GPU memory consumption (GB)
131
+ - **FLOPS**: Computational requirements
132
+ - **Throughput**: Steps/second during inference
133
+
134
+ #### Robustness
135
+ - **Success Rate under Noise**: Performance with added sensor noise
136
+ - **Generalization**: Performance on unseen objects/scenes
137
+ - **Failure Mode Analysis**: Categorization of common failures
138
+ - **Recovery Rate**: Ability to recover from perturbations
139
+
140
+ ### Evaluation Protocol
141
+
142
+ 1. **Test Environments**
143
+ - Fixed initial conditions
144
+ - Multiple random seeds (recommended: 5+)
145
+ - Human baseline comparison
146
+ - Ablation studies
147
+
148
+ 2. **Visualization**
149
+ - Trajectory plots (ground truth vs predicted)
150
+ - Attention heatmaps
151
+ - Failure case analysis
152
+ - Action distribution plots
153
+
154
+ 3. **Reporting**
155
+ - Mean and standard deviation across seeds
156
+ - Statistical significance testing
157
+ - Compute requirements (GPU hours, memory)
158
+ - Hyperparameter sensitivity analysis
159
+ - Processes both visual and language conditioning
160
+
161
+ 5. **Training Configuration**:
162
+ - Optimizer: AdamW (lr=1e-4, weight_decay=1e-6)
163
+ - Diffusion Steps: 100
164
+ - Chunk Size: 16
165
+ - Action Steps: 8
166
+ - Observation Steps: 1
167
+
168
+ The model processes visual inputs through the SigLIP vision encoder and textual instructions through the Qwen3-1.7B language model, then fuses these representations in the Eagle2.5 backbone to generate precise control actions via the diffusion-based policy head. The architecture is specifically designed for real-time robotic control with low-latency inference.
169
 
170
  ## Uses
171