Add Gemma-GR00T model weights
Browse files
README.md
CHANGED
@@ -3,29 +3,44 @@ language:
|
|
3 |
- en
|
4 |
license: mit
|
5 |
library_name: transformers
|
6 |
-
|
|
|
7 |
datasets:
|
8 |
- lerobot/robot_sim.PickNPlace
|
9 |
- lerobot/so100_strawberry_grape
|
10 |
base_model: NVEagle/eagle_er-qwen3_1_7B-Siglip2_400M_stage1_5_128gpu_er_v7_1mlp_nops
|
11 |
tags:
|
12 |
- robotics
|
|
|
13 |
- reinforcement-learning
|
14 |
- imitation-learning
|
15 |
-
- gemma
|
16 |
-
- gr00t
|
17 |
- nvidia
|
18 |
-
-
|
19 |
-
-
|
20 |
-
- robot-manipulation
|
21 |
-
- gemma-le
|
22 |
- diffusion-policy
|
23 |
-
-
|
24 |
- robot-learning
|
25 |
- embodied-ai
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
26 |
---
|
27 |
|
28 |
-
# Gemma-GR00T:
|
|
|
|
|
29 |
|
30 |
## Model Description
|
31 |
|
@@ -33,46 +48,124 @@ Gemma-GR00T is a state-of-the-art multimodal vision-language-action policy that
|
|
33 |
|
34 |
## Model Details
|
35 |
|
36 |
-
- **
|
37 |
-
- **Model
|
38 |
-
- **
|
39 |
-
- **License:** MIT
|
40 |
-
- **Finetuned from model:** [NVEagle/eagle_er-qwen3_1_7B-Siglip2_400M_stage1_5_128gpu_er_v7_1mlp_nops](https://huggingface.co/NVEagle/eagle_er-qwen3_1_7B-Siglip2_400M_stage1_5_128gpu_er_v7_1mlp_nops)
|
41 |
- **Training Data:** Trained on LeRobot datasets using the `fourier_gr1_arms_only` configuration
|
42 |
- **Framework:** PyTorch with Hugging Face Transformers
|
43 |
-
- **Related Models:** [LeRobot Models](https://huggingface.co/lerobot)
|
44 |
- **Related Datasets:** [LeRobot Datasets](https://huggingface.co/lerobot/datasets)
|
45 |
|
46 |
### Model Architecture
|
47 |
|
48 |
The model is built on a sophisticated multimodal architecture that combines state-of-the-art vision and language models for robotic control:
|
49 |
|
50 |
-
1. **Backbone**: `Eagle2_5_VLForConditionalGeneration`
|
|
|
|
|
51 |
|
52 |
2. **Text Encoder**: `Qwen3-1.7B`
|
|
|
53 |
- Type: Causal Language Model
|
54 |
- Parameters: 1.7B
|
55 |
- Layers: 28
|
56 |
- Attention: 16 heads for Q, 8 heads for KV (GQA)
|
57 |
- Context Length: 32,768 tokens
|
58 |
-
- Features:
|
|
|
|
|
|
|
59 |
|
60 |
3. **Vision Encoder**: `SigLIP` (Sigmoid Loss for Language-Image Pre-training)
|
|
|
61 |
- Type: Vision Transformer (ViT)
|
62 |
-
- Patch Size:
|
63 |
-
-
|
64 |
-
-
|
65 |
-
-
|
66 |
-
-
|
|
|
|
|
|
|
|
|
67 |
|
68 |
4. **Action Head**: Diffusion-based Policy
|
69 |
- Type: Flow-matching action head
|
70 |
-
- Architecture:
|
71 |
-
- Hidden Size:
|
72 |
-
-
|
73 |
-
-
|
74 |
-
|
75 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
76 |
|
77 |
## Uses
|
78 |
|
|
|
3 |
- en
|
4 |
license: mit
|
5 |
library_name: transformers
|
6 |
+
# Using text-to-video as the pipeline tag since the model generates action sequences from vision and language inputs
|
7 |
+
pipeline_tag: text-to-video
|
8 |
datasets:
|
9 |
- lerobot/robot_sim.PickNPlace
|
10 |
- lerobot/so100_strawberry_grape
|
11 |
base_model: NVEagle/eagle_er-qwen3_1_7B-Siglip2_400M_stage1_5_128gpu_er_v7_1mlp_nops
|
12 |
tags:
|
13 |
- robotics
|
14 |
+
- vision-language-action
|
15 |
- reinforcement-learning
|
16 |
- imitation-learning
|
|
|
|
|
17 |
- nvidia
|
18 |
+
- gr00t
|
19 |
+
- gemma
|
|
|
|
|
20 |
- diffusion-policy
|
21 |
+
- lerobot
|
22 |
- robot-learning
|
23 |
- embodied-ai
|
24 |
+
- humanoid-robots
|
25 |
+
- robot-manipulation
|
26 |
+
- computer-vision
|
27 |
+
- natural-language-processing
|
28 |
+
- deep-learning
|
29 |
+
- transformer
|
30 |
+
- vision-transformer
|
31 |
+
- flow-matching
|
32 |
+
- foundation-model
|
33 |
+
- multi-modal
|
34 |
+
- human-robot-interaction
|
35 |
+
- autonomous-robots
|
36 |
+
- robot-control
|
37 |
+
- robot-perception
|
38 |
+
- robot-vision
|
39 |
---
|
40 |
|
41 |
+
# Gemma-GR00T: A Vision-Language-Action Model for Robotic Control
|
42 |
+
|
43 |
+
This is a fine-tuned version of the NVIDIA GR00T N1.5 model, adapted for robotic control tasks using the LeRobot framework. The model combines vision, language, and action generation capabilities to enable robots to perform complex manipulation tasks based on natural language instructions.
|
44 |
|
45 |
## Model Description
|
46 |
|
|
|
48 |
|
49 |
## Model Details
|
50 |
|
51 |
+
- **Model type:** Vision-Language-Action (VLA) model
|
52 |
+
- **Base Model:** [NVEagle/eagle_er-qwen3_1_7B-Siglip2_400M_stage1_5_128gpu_er_v7_1mlp_nops](https://huggingface.co/NVEagle/eagle_er-qwen3_1_7B-Siglip2_400M_stage1_5_128gpu_er_v7_1mlp_nops)
|
53 |
+
- **Task:** text-to-video (robot action generation from vision and language)
|
|
|
|
|
54 |
- **Training Data:** Trained on LeRobot datasets using the `fourier_gr1_arms_only` configuration
|
55 |
- **Framework:** PyTorch with Hugging Face Transformers
|
56 |
+
- **Related Models:** [NVIDIA GR00T-N1.5-3B](https://huggingface.co/nvidia/GR00T-N1.5-3B), [LeRobot Models](https://huggingface.co/lerobot)
|
57 |
- **Related Datasets:** [LeRobot Datasets](https://huggingface.co/lerobot/datasets)
|
58 |
|
59 |
### Model Architecture
|
60 |
|
61 |
The model is built on a sophisticated multimodal architecture that combines state-of-the-art vision and language models for robotic control:
|
62 |
|
63 |
+
1. **Backbone**: `Eagle2_5_VLForConditionalGeneration`
|
64 |
+
- A powerful vision-language model that processes both visual and textual inputs
|
65 |
+
- Integrates vision and language representations for multimodal understanding
|
66 |
|
67 |
2. **Text Encoder**: `Qwen3-1.7B`
|
68 |
+
- Base Model: [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B)
|
69 |
- Type: Causal Language Model
|
70 |
- Parameters: 1.7B
|
71 |
- Layers: 28
|
72 |
- Attention: 16 heads for Q, 8 heads for KV (GQA)
|
73 |
- Context Length: 32,768 tokens
|
74 |
+
- Features:
|
75 |
+
- Strong reasoning and instruction-following capabilities
|
76 |
+
- Optimized for long-context understanding
|
77 |
+
- Supports complex language understanding and generation
|
78 |
|
79 |
3. **Vision Encoder**: `SigLIP` (Sigmoid Loss for Language-Image Pre-training)
|
80 |
+
- Base Model: [google/siglip-base-patch16-224](https://huggingface.co/google/siglip-base-patch16-224)
|
81 |
- Type: Vision Transformer (ViT)
|
82 |
+
- Patch Size: 16x16
|
83 |
+
- Image Size: 224x224
|
84 |
+
- Hidden Size: 768
|
85 |
+
- Layers: 12
|
86 |
+
- Attention Heads: 12
|
87 |
+
- Features:
|
88 |
+
- Strong visual representation learning
|
89 |
+
- Excellent zero-shot classification capabilities
|
90 |
+
- Robust to various visual domains
|
91 |
|
92 |
4. **Action Head**: Diffusion-based Policy
|
93 |
- Type: Flow-matching action head
|
94 |
+
- Architecture: 4-layer transformer (ScaledDP)
|
95 |
+
- Hidden Size: 512
|
96 |
+
- Feed-Forward Size: 2,048
|
97 |
+
- Attention Heads: 8
|
98 |
+
- Features:
|
99 |
+
- Generates smooth, continuous actions for robotic control
|
100 |
+
- Uses diffusion process for action generation
|
101 |
+
|
102 |
+
## Training & Evaluation
|
103 |
+
|
104 |
+
### Training Performance
|
105 |
+
|
106 |
+
- **Total Training Steps**: 30,000
|
107 |
+
- **Final Epoch**: 114.5
|
108 |
+
- **Initial Loss**: 1.27
|
109 |
+
- **Final Loss**: 0.11
|
110 |
+
- **Learning Rate**: Warmup to 1e-5 with gradual decay
|
111 |
+
- **Gradient Norm**: Stabilized around 0.3-1.0 (initial: 11.1)
|
112 |
+
|
113 |
+
### Recommended Evaluation Metrics
|
114 |
+
|
115 |
+
#### Task Performance
|
116 |
+
- **Success Rate**: Percentage of successful task completions
|
117 |
+
- **Path Length**: Efficiency of movement (shorter paths are better)
|
118 |
+
- **Smoothness**: L2 norm of action derivatives (lower is smoother)
|
119 |
+
- **Goal Distance**: Final distance to target position
|
120 |
+
- **Success Rate at k (SR@k)**: Success rate within k attempts
|
121 |
+
|
122 |
+
#### Model Accuracy
|
123 |
+
- **Action MSE**: Mean squared error of predicted vs. ground truth actions
|
124 |
+
- **Per-Joint Position Error**: Error for each degree of freedom
|
125 |
+
- **Gripper Accuracy**: Binary classification of gripper state
|
126 |
+
- **Trajectory Error**: Dynamic Time Warping (DTW) distance from reference
|
127 |
+
|
128 |
+
#### System Efficiency
|
129 |
+
- **Inference Time**: Per-step latency (ms)
|
130 |
+
- **Memory Usage**: Peak GPU memory consumption (GB)
|
131 |
+
- **FLOPS**: Computational requirements
|
132 |
+
- **Throughput**: Steps/second during inference
|
133 |
+
|
134 |
+
#### Robustness
|
135 |
+
- **Success Rate under Noise**: Performance with added sensor noise
|
136 |
+
- **Generalization**: Performance on unseen objects/scenes
|
137 |
+
- **Failure Mode Analysis**: Categorization of common failures
|
138 |
+
- **Recovery Rate**: Ability to recover from perturbations
|
139 |
+
|
140 |
+
### Evaluation Protocol
|
141 |
+
|
142 |
+
1. **Test Environments**
|
143 |
+
- Fixed initial conditions
|
144 |
+
- Multiple random seeds (recommended: 5+)
|
145 |
+
- Human baseline comparison
|
146 |
+
- Ablation studies
|
147 |
+
|
148 |
+
2. **Visualization**
|
149 |
+
- Trajectory plots (ground truth vs predicted)
|
150 |
+
- Attention heatmaps
|
151 |
+
- Failure case analysis
|
152 |
+
- Action distribution plots
|
153 |
+
|
154 |
+
3. **Reporting**
|
155 |
+
- Mean and standard deviation across seeds
|
156 |
+
- Statistical significance testing
|
157 |
+
- Compute requirements (GPU hours, memory)
|
158 |
+
- Hyperparameter sensitivity analysis
|
159 |
+
- Processes both visual and language conditioning
|
160 |
+
|
161 |
+
5. **Training Configuration**:
|
162 |
+
- Optimizer: AdamW (lr=1e-4, weight_decay=1e-6)
|
163 |
+
- Diffusion Steps: 100
|
164 |
+
- Chunk Size: 16
|
165 |
+
- Action Steps: 8
|
166 |
+
- Observation Steps: 1
|
167 |
+
|
168 |
+
The model processes visual inputs through the SigLIP vision encoder and textual instructions through the Qwen3-1.7B language model, then fuses these representations in the Eagle2.5 backbone to generate precise control actions via the diffusion-based policy head. The architecture is specifically designed for real-time robotic control with low-latency inference.
|
169 |
|
170 |
## Uses
|
171 |
|