--- license: apache-2.0 --- # QuixiGR00T-N1.5-3B-Zero by Eric Hartford I love GR00T but NVidia's license - Tsk-tsk, no no no, that won't do at all. Also all their inference code is wrapped in hard coded CUDA dependencies. Rude. The world - our future - and our children's future - deserves a high quality permissively licensed robot control model that isn't tied to any specific hardware. This repo contains a fully open-source Apache 2.0 licensed, randomly initialized version of the GR00T-N1.5-3B architecture for humanoid robot control. This model has the exact same architecture as NVIDIA's GR00T-N1.5-3B but with random weights. And NO it's NOT gonna be uncensored! It's driving a humanoid robot you guys! I am not trying to burn down the world here! (you can easily finetune it to do ANYTHING you want it to.) I created this model using [this script](init_DolphinGR00T_zero.py) The purpose is to distill GR00T into an Apache-2.0 licensed version. The whole job looks like this: 1) make an Apache 2.0 licensed "blank slate" with the right shape (this repo) 2) Track down the sub-components that are Apache 2.0, and bring those weights in. (qwen3-1.7b, for instance, is used as the language tower.) 3) missing components - find some initialization that's better than "random" - like merging from similar models into the correct shape. 4) distill GR00T onto it with online logit distillation. The model's small, easy to load both models into vram! ## Model Description DolphinGR00T-N1.5-3B-Zero is a Vision-Language-Action (VLA) model designed for humanoid robot control: - **Architecture**: Dual-system design with vision-language backbone (Eagle-based with Qwen3 LLM) and diffusion transformer action head - **Parameters**: 2,724M total (1,655M backbone in bfloat16, 1,069M action head in float32) - **License**: Apache-2.0 (fully open source) - **Weights**: Randomly initialized - no pre-training, ready for your own training ## Key Features - ✅ **Exact architecture match** with NVIDIA GR00T-N1.5-3B - ✅ **No license restrictions** - Apache-2.0 throughout - ✅ **Mixed precision ready** - bfloat16 backbone, float32 action head - ✅ **Multi-modal inputs** - images, language instructions, and robot proprioception - ✅ **Continuous action output** via diffusion transformer ## Installation ```bash pip install torch transformers safetensors ``` ## Usage ### Loading the Model ```python import torch from transformers import AutoModel, AutoTokenizer # Load model model = AutoModel.from_pretrained( "DolphinGR00T-N1.5-3B-Zero", trust_remote_code=True, torch_dtype="auto" ) # Load tokenizer tokenizer = AutoTokenizer.from_pretrained("DolphinGR00T-N1.5-3B-Zero") # Move to GPU if available device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = model.to(device) ``` ### Inference Example ```python import torch import torch.nn.functional as F from PIL import Image import numpy as np def prepare_image(image_path, target_size=(224, 224)): """Prepare image for model input""" image = Image.open(image_path).convert('RGB') image = image.resize(target_size) # Normalize to [-1, 1] image = np.array(image).astype(np.float32) / 127.5 - 1.0 image = torch.from_numpy(image).permute(2, 0, 1) return image def inference(model, tokenizer, image_paths, instruction, robot_state, device): """ Run inference to generate robot actions Args: image_paths: List of paths to camera images instruction: Natural language instruction robot_state: Current robot proprioception (joint angles, etc.) device: torch device Returns: actions: Predicted robot actions """ model.eval() with torch.no_grad(): # Prepare inputs images = torch.stack([prepare_image(path) for path in image_paths]) images = images.unsqueeze(0).to(device) # Add batch dimension # Tokenize instruction text_inputs = tokenizer( instruction, return_tensors="pt", padding=True, truncation=True, max_length=256 ).to(device) # Robot state (example: 32-dim joint angles) if isinstance(robot_state, list): robot_state = torch.tensor(robot_state, dtype=torch.float32) robot_state = robot_state.unsqueeze(0).to(device) # Forward pass through backbone # Note: This is a simplified example - actual implementation depends on model interface vision_features = model.backbone.eagle_model.vision_model(images) # Process language language_features = model.backbone.eagle_model.language_model.model( input_ids=text_inputs.input_ids, attention_mask=text_inputs.attention_mask ).last_hidden_state # Combine features (simplified - actual fusion may be more complex) combined_features = torch.cat([ vision_features.mean(dim=1), # Pool vision features language_features.mean(dim=1) # Pool language features ], dim=-1) # Generate actions through diffusion process # This is a simplified placeholder - actual diffusion requires multiple steps action_features = model.action_head.model( combined_features, timesteps=torch.zeros(1, device=device), context=robot_state ) # Decode to action space actions = model.action_head.action_decoder(action_features) return actions # Example usage image_paths = ["camera1.jpg", "camera2.jpg"] instruction = "Pick up the red cube and place it on the table" robot_state = torch.randn(32) # Example: 32 joint angles actions = inference(model, tokenizer, image_paths, instruction, robot_state, device) print(f"Predicted actions shape: {actions.shape}") ``` ### Training Example ```python import torch import torch.nn as nn from torch.utils.data import DataLoader, Dataset from transformers import get_linear_schedule_with_warmup class RobotDataset(Dataset): """Example dataset for robot manipulation tasks""" def __init__(self, data_path, tokenizer, transform=None): self.data = [] # Load your data here self.tokenizer = tokenizer self.transform = transform def __len__(self): return len(self.data) def __getitem__(self, idx): # Return dict with keys: images, instruction, robot_state, target_actions sample = self.data[idx] # Process images images = torch.stack([self.transform(img) for img in sample['images']]) # Tokenize instruction text = self.tokenizer( sample['instruction'], return_tensors="pt", padding="max_length", truncation=True, max_length=256 ) return { 'images': images, 'input_ids': text['input_ids'].squeeze(), 'attention_mask': text['attention_mask'].squeeze(), 'robot_state': torch.tensor(sample['robot_state'], dtype=torch.float32), 'target_actions': torch.tensor(sample['target_actions'], dtype=torch.float32) } def train_step(model, batch, criterion, device): """Single training step""" # Move batch to device images = batch['images'].to(device) input_ids = batch['input_ids'].to(device) attention_mask = batch['attention_mask'].to(device) robot_state = batch['robot_state'].to(device) target_actions = batch['target_actions'].to(device) # Forward pass (simplified - actual implementation may differ) # Process vision vision_features = model.backbone.eagle_model.vision_model(images) # Process language language_output = model.backbone.eagle_model.language_model.model( input_ids=input_ids, attention_mask=attention_mask ) language_features = language_output.last_hidden_state # Combine modalities combined_features = torch.cat([ vision_features.mean(dim=1), language_features.mean(dim=1) ], dim=-1) # Generate actions (simplified diffusion) predicted_actions = model.action_head( combined_features, context=robot_state ) # Compute loss loss = criterion(predicted_actions, target_actions) return loss # Training setup def train_model(model, train_dataset, val_dataset, config): """Main training loop""" device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = model.to(device) # Create dataloaders train_loader = DataLoader( train_dataset, batch_size=config['batch_size'], shuffle=True, num_workers=4 ) val_loader = DataLoader( val_dataset, batch_size=config['batch_size'], shuffle=False, num_workers=4 ) # Setup optimizer with different learning rates for backbone and action head optimizer = torch.optim.AdamW([ {'params': model.backbone.parameters(), 'lr': config['backbone_lr']}, {'params': model.action_head.parameters(), 'lr': config['action_head_lr']} ], weight_decay=config['weight_decay']) # Learning rate scheduler num_training_steps = len(train_loader) * config['num_epochs'] scheduler = get_linear_schedule_with_warmup( optimizer, num_warmup_steps=config['warmup_steps'], num_training_steps=num_training_steps ) # Loss function criterion = nn.MSELoss() # or nn.L1Loss() for action prediction # Training loop for epoch in range(config['num_epochs']): model.train() total_loss = 0 for batch_idx, batch in enumerate(train_loader): optimizer.zero_grad() loss = train_step(model, batch, criterion, device) loss.backward() # Gradient clipping torch.nn.utils.clip_grad_norm_( model.parameters(), config['max_grad_norm'] ) optimizer.step() scheduler.step() total_loss += loss.item() if batch_idx % config['log_interval'] == 0: print(f"Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}") # Validation model.eval() val_loss = 0 with torch.no_grad(): for batch in val_loader: loss = train_step(model, batch, criterion, device) val_loss += loss.item() avg_train_loss = total_loss / len(train_loader) avg_val_loss = val_loss / len(val_loader) print(f"Epoch {epoch}: Train Loss: {avg_train_loss:.4f}, Val Loss: {avg_val_loss:.4f}") # Save checkpoint if (epoch + 1) % config['save_interval'] == 0: torch.save({ 'epoch': epoch, 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict(), 'scheduler_state_dict': scheduler.state_dict(), 'train_loss': avg_train_loss, 'val_loss': avg_val_loss, }, f"checkpoint_epoch_{epoch+1}.pt") # Example configuration config = { 'batch_size': 16, 'num_epochs': 100, 'backbone_lr': 1e-5, 'action_head_lr': 1e-4, 'weight_decay': 0.01, 'warmup_steps': 1000, 'max_grad_norm': 1.0, 'log_interval': 10, 'save_interval': 10 } # Create dataset (you need to implement data loading) # train_dataset = RobotDataset("path/to/train/data", tokenizer) # val_dataset = RobotDataset("path/to/val/data", tokenizer) # Train model # train_model(model, train_dataset, val_dataset, config) ``` ### Fine-tuning Tips 1. **Mixed Precision Training**: The model is designed for mixed precision. Use `torch.cuda.amp` for faster training: ```python from torch.cuda.amp import GradScaler, autocast scaler = GradScaler() with autocast(): loss = train_step(model, batch, criterion, device) scaler.scale(loss).backward() scaler.step(optimizer) scaler.update() ``` 2. **Gradient Checkpointing**: For memory-efficient training: ```python model.backbone.eagle_model.language_model.gradient_checkpointing_enable() ``` 3. **Frozen Backbone Training**: Start by training only the action head: ```python # Freeze backbone for param in model.backbone.parameters(): param.requires_grad = False # Train only action head optimizer = torch.optim.AdamW( model.action_head.parameters(), lr=1e-4 ) ``` ## Model Architecture The model consists of two main components: ### 1. Vision-Language Backbone (System 2) - **Vision Encoder**: Based on Eagle vision model with 27 transformer layers - **Language Model**: Qwen3-based LLM with 12 layers, 2048 hidden dim - **Cross-modal Fusion**: MLP connector between vision and language ### 2. Action Head (System 1) - **Diffusion Transformer**: 16 DiT blocks for action generation - **State Encoder**: Processes robot proprioception - **Action Decoder**: Outputs continuous robot actions - **Self-Attention Blocks**: 4 transformer blocks for vision-language features ## Limitations - This is a **blank model** with random weights - it requires training before use - No pre-trained knowledge or capabilities - Designed for humanoid robots but can be adapted for other embodiments - Requires significant computational resources for training ## Citation If you use this model in your research, please cite: ```bibtex @software{DolphinGR00T2024, title={DolphinGR00T-N1.5-3B-Zero: a Permissively Licensed Reimplementation of GR00T-N1.5-3B}, author={Eric Hartford}, year={2024}, license={Apache-2.0} } ``` ## License Apache-2.0 - This model is fully open source with no restrictions. ## Acknowledgments This is an independent implementation of the GR00T architecture for the open-source community. The architecture is based on publicly available information about NVIDIA's GR00T-N1.5 model, but contains no proprietary code or weights.