---
license: apache-2.0
---
# QuixiGR00T-N1.5-3B-Zero

by Eric Hartford

I love GR00T but NVidia's license - Tsk-tsk, no no no, that won't do at all.  
Also all their inference code is wrapped in hard coded CUDA dependencies.  Rude.

The world - our future - and our children's future - deserves a high quality permissively licensed robot control model that isn't tied to any specific hardware.

This repo contains a fully open-source Apache 2.0 licensed, randomly initialized version of the GR00T-N1.5-3B architecture for humanoid robot control. This model has the exact same architecture as NVIDIA's GR00T-N1.5-3B but with random weights.

And NO it's NOT gonna be uncensored!  It's driving a humanoid robot you guys!  I am not trying to burn down the world here!  (you can easily finetune it to do ANYTHING you want it to.)

I created this model using [this script](init_DolphinGR00T_zero.py)

The purpose is to distill GR00T into an Apache-2.0 licensed version.

The whole job looks like this:

1) make an Apache 2.0 licensed "blank slate" with the right shape (this repo)
2) Track down the sub-components that are Apache 2.0, and bring those weights in.  (qwen3-1.7b, for instance, is used as the language tower.)
3) missing components - find some initialization that's better than "random" - like merging from similar models into the correct shape.
4) distill GR00T onto it with online logit distillation.  The model's small, easy to load both models into vram!

<img src="https://cdn-uploads.huggingface.co/production/uploads/63111b2d88942700629f5771/SJo_YZmamnVRoBI4lMCoC.png" width="600" />

## Model Description

DolphinGR00T-N1.5-3B-Zero is a Vision-Language-Action (VLA) model designed for humanoid robot control:

- **Architecture**: Dual-system design with vision-language backbone (Eagle-based with Qwen3 LLM) and diffusion transformer action head
- **Parameters**: 2,724M total (1,655M backbone in bfloat16, 1,069M action head in float32)
- **License**: Apache-2.0 (fully open source)
- **Weights**: Randomly initialized - no pre-training, ready for your own training

## Key Features

- ✅ **Exact architecture match** with NVIDIA GR00T-N1.5-3B
- ✅ **No license restrictions** - Apache-2.0 throughout
- ✅ **Mixed precision ready** - bfloat16 backbone, float32 action head
- ✅ **Multi-modal inputs** - images, language instructions, and robot proprioception
- ✅ **Continuous action output** via diffusion transformer

## Installation

```bash
pip install torch transformers safetensors
```

## Usage

### Loading the Model

```python
import torch
from transformers import AutoModel, AutoTokenizer

# Load model
model = AutoModel.from_pretrained(
    "DolphinGR00T-N1.5-3B-Zero",
    trust_remote_code=True,
    torch_dtype="auto"
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("DolphinGR00T-N1.5-3B-Zero")

# Move to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
```

### Inference Example

```python
import torch
import torch.nn.functional as F
from PIL import Image
import numpy as np

def prepare_image(image_path, target_size=(224, 224)):
    """Prepare image for model input"""
    image = Image.open(image_path).convert('RGB')
    image = image.resize(target_size)
    # Normalize to [-1, 1]
    image = np.array(image).astype(np.float32) / 127.5 - 1.0
    image = torch.from_numpy(image).permute(2, 0, 1)
    return image

def inference(model, tokenizer, image_paths, instruction, robot_state, device):
    """
    Run inference to generate robot actions
    
    Args:
        image_paths: List of paths to camera images
        instruction: Natural language instruction
        robot_state: Current robot proprioception (joint angles, etc.)
        device: torch device
    
    Returns:
        actions: Predicted robot actions
    """
    model.eval()
    
    with torch.no_grad():
        # Prepare inputs
        images = torch.stack([prepare_image(path) for path in image_paths])
        images = images.unsqueeze(0).to(device)  # Add batch dimension
        
        # Tokenize instruction
        text_inputs = tokenizer(
            instruction,
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=256
        ).to(device)
        
        # Robot state (example: 32-dim joint angles)
        if isinstance(robot_state, list):
            robot_state = torch.tensor(robot_state, dtype=torch.float32)
        robot_state = robot_state.unsqueeze(0).to(device)
        
        # Forward pass through backbone
        # Note: This is a simplified example - actual implementation depends on model interface
        vision_features = model.backbone.eagle_model.vision_model(images)
        
        # Process language
        language_features = model.backbone.eagle_model.language_model.model(
            input_ids=text_inputs.input_ids,
            attention_mask=text_inputs.attention_mask
        ).last_hidden_state
        
        # Combine features (simplified - actual fusion may be more complex)
        combined_features = torch.cat([
            vision_features.mean(dim=1),  # Pool vision features
            language_features.mean(dim=1)  # Pool language features
        ], dim=-1)
        
        # Generate actions through diffusion process
        # This is a simplified placeholder - actual diffusion requires multiple steps
        action_features = model.action_head.model(
            combined_features,
            timesteps=torch.zeros(1, device=device),
            context=robot_state
        )
        
        # Decode to action space
        actions = model.action_head.action_decoder(action_features)
        
    return actions

# Example usage
image_paths = ["camera1.jpg", "camera2.jpg"]
instruction = "Pick up the red cube and place it on the table"
robot_state = torch.randn(32)  # Example: 32 joint angles

actions = inference(model, tokenizer, image_paths, instruction, robot_state, device)
print(f"Predicted actions shape: {actions.shape}")
```

### Training Example

```python
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
from transformers import get_linear_schedule_with_warmup

class RobotDataset(Dataset):
    """Example dataset for robot manipulation tasks"""
    def __init__(self, data_path, tokenizer, transform=None):
        self.data = []  # Load your data here
        self.tokenizer = tokenizer
        self.transform = transform
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        # Return dict with keys: images, instruction, robot_state, target_actions
        sample = self.data[idx]
        
        # Process images
        images = torch.stack([self.transform(img) for img in sample['images']])
        
        # Tokenize instruction
        text = self.tokenizer(
            sample['instruction'],
            return_tensors="pt",
            padding="max_length",
            truncation=True,
            max_length=256
        )
        
        return {
            'images': images,
            'input_ids': text['input_ids'].squeeze(),
            'attention_mask': text['attention_mask'].squeeze(),
            'robot_state': torch.tensor(sample['robot_state'], dtype=torch.float32),
            'target_actions': torch.tensor(sample['target_actions'], dtype=torch.float32)
        }

def train_step(model, batch, criterion, device):
    """Single training step"""
    # Move batch to device
    images = batch['images'].to(device)
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    robot_state = batch['robot_state'].to(device)
    target_actions = batch['target_actions'].to(device)
    
    # Forward pass (simplified - actual implementation may differ)
    # Process vision
    vision_features = model.backbone.eagle_model.vision_model(images)
    
    # Process language
    language_output = model.backbone.eagle_model.language_model.model(
        input_ids=input_ids,
        attention_mask=attention_mask
    )
    language_features = language_output.last_hidden_state
    
    # Combine modalities
    combined_features = torch.cat([
        vision_features.mean(dim=1),
        language_features.mean(dim=1)
    ], dim=-1)
    
    # Generate actions (simplified diffusion)
    predicted_actions = model.action_head(
        combined_features,
        context=robot_state
    )
    
    # Compute loss
    loss = criterion(predicted_actions, target_actions)
    
    return loss

# Training setup
def train_model(model, train_dataset, val_dataset, config):
    """Main training loop"""
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    
    # Create dataloaders
    train_loader = DataLoader(
        train_dataset,
        batch_size=config['batch_size'],
        shuffle=True,
        num_workers=4
    )
    
    val_loader = DataLoader(
        val_dataset,
        batch_size=config['batch_size'],
        shuffle=False,
        num_workers=4
    )
    
    # Setup optimizer with different learning rates for backbone and action head
    optimizer = torch.optim.AdamW([
        {'params': model.backbone.parameters(), 'lr': config['backbone_lr']},
        {'params': model.action_head.parameters(), 'lr': config['action_head_lr']}
    ], weight_decay=config['weight_decay'])
    
    # Learning rate scheduler
    num_training_steps = len(train_loader) * config['num_epochs']
    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=config['warmup_steps'],
        num_training_steps=num_training_steps
    )
    
    # Loss function
    criterion = nn.MSELoss()  # or nn.L1Loss() for action prediction
    
    # Training loop
    for epoch in range(config['num_epochs']):
        model.train()
        total_loss = 0
        
        for batch_idx, batch in enumerate(train_loader):
            optimizer.zero_grad()
            
            loss = train_step(model, batch, criterion, device)
            
            loss.backward()
            
            # Gradient clipping
            torch.nn.utils.clip_grad_norm_(
                model.parameters(),
                config['max_grad_norm']
            )
            
            optimizer.step()
            scheduler.step()
            
            total_loss += loss.item()
            
            if batch_idx % config['log_interval'] == 0:
                print(f"Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}")
        
        # Validation
        model.eval()
        val_loss = 0
        with torch.no_grad():
            for batch in val_loader:
                loss = train_step(model, batch, criterion, device)
                val_loss += loss.item()
        
        avg_train_loss = total_loss / len(train_loader)
        avg_val_loss = val_loss / len(val_loader)
        
        print(f"Epoch {epoch}: Train Loss: {avg_train_loss:.4f}, Val Loss: {avg_val_loss:.4f}")
        
        # Save checkpoint
        if (epoch + 1) % config['save_interval'] == 0:
            torch.save({
                'epoch': epoch,
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'scheduler_state_dict': scheduler.state_dict(),
                'train_loss': avg_train_loss,
                'val_loss': avg_val_loss,
            }, f"checkpoint_epoch_{epoch+1}.pt")

# Example configuration
config = {
    'batch_size': 16,
    'num_epochs': 100,
    'backbone_lr': 1e-5,
    'action_head_lr': 1e-4,
    'weight_decay': 0.01,
    'warmup_steps': 1000,
    'max_grad_norm': 1.0,
    'log_interval': 10,
    'save_interval': 10
}

# Create dataset (you need to implement data loading)
# train_dataset = RobotDataset("path/to/train/data", tokenizer)
# val_dataset = RobotDataset("path/to/val/data", tokenizer)

# Train model
# train_model(model, train_dataset, val_dataset, config)
```

### Fine-tuning Tips

1. **Mixed Precision Training**: The model is designed for mixed precision. Use `torch.cuda.amp` for faster training:
```python
from torch.cuda.amp import GradScaler, autocast

scaler = GradScaler()

with autocast():
    loss = train_step(model, batch, criterion, device)

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
```

2. **Gradient Checkpointing**: For memory-efficient training:
```python
model.backbone.eagle_model.language_model.gradient_checkpointing_enable()
```

3. **Frozen Backbone Training**: Start by training only the action head:
```python
# Freeze backbone
for param in model.backbone.parameters():
    param.requires_grad = False

# Train only action head
optimizer = torch.optim.AdamW(
    model.action_head.parameters(),
    lr=1e-4
)
```

## Model Architecture

The model consists of two main components:

### 1. Vision-Language Backbone (System 2)
- **Vision Encoder**: Based on Eagle vision model with 27 transformer layers
- **Language Model**: Qwen3-based LLM with 12 layers, 2048 hidden dim
- **Cross-modal Fusion**: MLP connector between vision and language

### 2. Action Head (System 1)
- **Diffusion Transformer**: 16 DiT blocks for action generation
- **State Encoder**: Processes robot proprioception
- **Action Decoder**: Outputs continuous robot actions
- **Self-Attention Blocks**: 4 transformer blocks for vision-language features

## Limitations

- This is a **blank model** with random weights - it requires training before use
- No pre-trained knowledge or capabilities
- Designed for humanoid robots but can be adapted for other embodiments
- Requires significant computational resources for training

## Citation

If you use this model in your research, please cite:

```bibtex
@software{DolphinGR00T2024,
  title={DolphinGR00T-N1.5-3B-Zero: a Permissively Licensed Reimplementation of GR00T-N1.5-3B},
  author={Eric Hartford},
  year={2024},
  license={Apache-2.0}
}
```

## License

Apache-2.0 - This model is fully open source with no restrictions.

## Acknowledgments

This is an independent implementation of the GR00T architecture for the open-source community. The architecture is based on publicly available information about NVIDIA's GR00T-N1.5 model, but contains no proprietary code or weights.