Ryukijano
/

gemma-groot

imitation-learning

diffusion-policy

Model card Files Files and versions

Ryukijano commited on Jul 27

Commit

ed6cb09

·

verified ·

1 Parent(s): 17c92b9

Add Gemma-GR00T model weights

Files changed (1) hide show

README.md +28 -8

README.md CHANGED Viewed

@@ -45,14 +45,34 @@ Gemma-GR00T is a state-of-the-art multimodal vision-language-action policy that
 ### Model Architecture
-- **Backbone:** Gemma-based vision-language model
-- **Action Head:** Diffusion-based policy with cross-attention
-- **Vision Encoder:** SigLIP-400M
-- **Action Space:** 32-dimensional continuous actions
-- **Horizon:** 16 timesteps
-- **Diffusion Steps:** 4 (inference)
-- **Hidden Size:** 1024
-- **Attention Heads:** 32
 ## Uses

 ### Model Architecture
+The model is built on a sophisticated multimodal architecture that combines state-of-the-art vision and language models for robotic control:
+1. **Backbone**: `Eagle2_5_VLForConditionalGeneration` - A powerful vision-language model that processes both visual and textual inputs.
+2. **Text Encoder**: `Qwen3-1.7B`
+   - Type: Causal Language Model
+   - Parameters: 1.7B
+   - Layers: 28
+   - Attention: 16 heads for Q, 8 heads for KV (GQA)
+   - Context Length: 32,768 tokens
+   - Features: Strong reasoning and instruction-following capabilities
+3. **Vision Encoder**: `SigLIP` (Sigmoid Loss for Language-Image Pre-training)
+   - Type: Vision Transformer (ViT)
+   - Patch Size: 14x14
+   - Hidden Size: 1,152
+   - Layers: 27
+   - Attention Heads: 16
+   - Features: Strong visual representation learning with localization capabilities
+4. **Action Head**: Diffusion-based Policy
+   - Type: Flow-matching action head
+   - Architecture: 16-layer transformer
+   - Hidden Size: 1,024
+   - Attention Heads: 32
+   - Features: Generates smooth, continuous actions for robotic control
+The model processes visual inputs through the SigLIP vision encoder and textual instructions through the Qwen3-1.7B language model, then fuses these representations in the Eagle2.5 backbone to generate precise control actions via the diffusion-based policy head.
 ## Uses