Add Gemma-GR00T model weights
Browse files
README.md
CHANGED
@@ -45,14 +45,34 @@ Gemma-GR00T is a state-of-the-art multimodal vision-language-action policy that
|
|
45 |
|
46 |
### Model Architecture
|
47 |
|
48 |
-
|
49 |
-
|
50 |
-
|
51 |
-
|
52 |
-
|
53 |
-
-
|
54 |
-
-
|
55 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
56 |
|
57 |
## Uses
|
58 |
|
|
|
45 |
|
46 |
### Model Architecture
|
47 |
|
48 |
+
The model is built on a sophisticated multimodal architecture that combines state-of-the-art vision and language models for robotic control:
|
49 |
+
|
50 |
+
1. **Backbone**: `Eagle2_5_VLForConditionalGeneration` - A powerful vision-language model that processes both visual and textual inputs.
|
51 |
+
|
52 |
+
2. **Text Encoder**: `Qwen3-1.7B`
|
53 |
+
- Type: Causal Language Model
|
54 |
+
- Parameters: 1.7B
|
55 |
+
- Layers: 28
|
56 |
+
- Attention: 16 heads for Q, 8 heads for KV (GQA)
|
57 |
+
- Context Length: 32,768 tokens
|
58 |
+
- Features: Strong reasoning and instruction-following capabilities
|
59 |
+
|
60 |
+
3. **Vision Encoder**: `SigLIP` (Sigmoid Loss for Language-Image Pre-training)
|
61 |
+
- Type: Vision Transformer (ViT)
|
62 |
+
- Patch Size: 14x14
|
63 |
+
- Hidden Size: 1,152
|
64 |
+
- Layers: 27
|
65 |
+
- Attention Heads: 16
|
66 |
+
- Features: Strong visual representation learning with localization capabilities
|
67 |
+
|
68 |
+
4. **Action Head**: Diffusion-based Policy
|
69 |
+
- Type: Flow-matching action head
|
70 |
+
- Architecture: 16-layer transformer
|
71 |
+
- Hidden Size: 1,024
|
72 |
+
- Attention Heads: 32
|
73 |
+
- Features: Generates smooth, continuous actions for robotic control
|
74 |
+
|
75 |
+
The model processes visual inputs through the SigLIP vision encoder and textual instructions through the Qwen3-1.7B language model, then fuses these representations in the Eagle2.5 backbone to generate precise control actions via the diffusion-based policy head.
|
76 |
|
77 |
## Uses
|
78 |
|