Ryukijano commited on
Commit
ed6cb09
·
verified ·
1 Parent(s): 17c92b9

Add Gemma-GR00T model weights

Browse files
Files changed (1) hide show
  1. README.md +28 -8
README.md CHANGED
@@ -45,14 +45,34 @@ Gemma-GR00T is a state-of-the-art multimodal vision-language-action policy that
45
 
46
  ### Model Architecture
47
 
48
- - **Backbone:** Gemma-based vision-language model
49
- - **Action Head:** Diffusion-based policy with cross-attention
50
- - **Vision Encoder:** SigLIP-400M
51
- - **Action Space:** 32-dimensional continuous actions
52
- - **Horizon:** 16 timesteps
53
- - **Diffusion Steps:** 4 (inference)
54
- - **Hidden Size:** 1024
55
- - **Attention Heads:** 32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
 
57
  ## Uses
58
 
 
45
 
46
  ### Model Architecture
47
 
48
+ The model is built on a sophisticated multimodal architecture that combines state-of-the-art vision and language models for robotic control:
49
+
50
+ 1. **Backbone**: `Eagle2_5_VLForConditionalGeneration` - A powerful vision-language model that processes both visual and textual inputs.
51
+
52
+ 2. **Text Encoder**: `Qwen3-1.7B`
53
+ - Type: Causal Language Model
54
+ - Parameters: 1.7B
55
+ - Layers: 28
56
+ - Attention: 16 heads for Q, 8 heads for KV (GQA)
57
+ - Context Length: 32,768 tokens
58
+ - Features: Strong reasoning and instruction-following capabilities
59
+
60
+ 3. **Vision Encoder**: `SigLIP` (Sigmoid Loss for Language-Image Pre-training)
61
+ - Type: Vision Transformer (ViT)
62
+ - Patch Size: 14x14
63
+ - Hidden Size: 1,152
64
+ - Layers: 27
65
+ - Attention Heads: 16
66
+ - Features: Strong visual representation learning with localization capabilities
67
+
68
+ 4. **Action Head**: Diffusion-based Policy
69
+ - Type: Flow-matching action head
70
+ - Architecture: 16-layer transformer
71
+ - Hidden Size: 1,024
72
+ - Attention Heads: 32
73
+ - Features: Generates smooth, continuous actions for robotic control
74
+
75
+ The model processes visual inputs through the SigLIP vision encoder and textual instructions through the Qwen3-1.7B language model, then fuses these representations in the Eagle2.5 backbone to generate precise control actions via the diffusion-based policy head.
76
 
77
  ## Uses
78