Physics Foundation Vision Transformer (PhysicsViT-StandardVersion)

A Vision Transformer model trained on multi-physics simulation data for scientific computing applications. This model is specifically designed for understanding and analyzing physics simulations across multiple domains.

Model Version: Standard Version - Trained for 78,372 steps

Model Details

Model Description

Developed by: PhysicsAlchemists Research Team
Model type: Vision Transformer (ViT-Huge)
License: MIT License
Finetuned from model: Trained from scratch on physics simulation data
Training Steps: 78,372 steps

Model Architecture

Architecture: ViT-Huge (Feature Extraction)
Hidden size: 1280
Number of layers: 32
Number of attention heads: 16
Intermediate size: 5120
Image size: 224×224
Patch size: 16×16
Embedding dimension: 1280

Training Details

Training Data

The model was trained on a comprehensive dataset of physics simulations including:

Acoustic scattering (inclusions, discontinuous, maze)
Active matter simulations
Euler equations (multi-quadrants with open/periodic BC)
Gray-Scott reaction-diffusion
Helmholtz staircase
Planetary shallow water equations
Rayleigh-Bénard convection (standard and uniform)
Shear flow dynamics
Turbulent radiative layer (2D)
Viscoelastic instability

Training Configuration

Training regime: 78,372 steps
Batch size: 1,470
Learning rate: 0.0005 (with warmup and cosine decay)
Optimizer: Adam (β₁=0.9, β₂=0.999, weight_decay=0.0003)
Mixed precision: bfloat16
Hardware: Cerebras CS-X systems

Data Augmentation

Random colormap application (viridis, plasma, inferno, coolwarm)
Grayscale conversion (30% probability)
Temporal trajectory preservation during training

Usage

⚠️ Important: This model requires specific preprocessing that differs from standard ViT models.

Basic Usage

from transformers import AutoModel, AutoImageProcessor
from PIL import Image
import torch

# Load model and processor
model = AutoModel.from_pretrained("JessicaE/physics-vit-standard")
processor = AutoImageProcessor.from_pretrained("JessicaE/physics-vit-standard")

# Load your physics image
image = Image.open("physics_simulation.png").convert('RGB')

# Apply custom preprocessing
image = expand_to_square(image, background_color=(128, 128, 128))
image = image.resize((224, 224), Image.BILINEAR)

# Convert to tensor and add batch dimension
from torchvision import transforms
tensor = transforms.ToTensor()(image).unsqueeze(0)

# Extract physics-aware embeddings
with torch.no_grad():
    outputs = model(pixel_values=tensor)
    
    # CLS token embedding (best for classification tasks)
    cls_embedding = outputs.last_hidden_state[:, 0, :]  # Shape: [1, 1280]
    
    # Average pooled embedding (good for trajectory prediction)  
    pooled_embedding = outputs.last_hidden_state.mean(dim=1)  # Shape: [1, 1280]
    
    # Patch embeddings (for spatial analysis)
    patch_embeddings = outputs.last_hidden_state[:, 1:, :]  # Shape: [1, 196, 1280]

print(f"CLS embedding shape: {cls_embedding.shape}")

Required Preprocessing Function

from PIL import Image

def expand_to_square(pil_img, background_color):
    """
    Pad image to square with background color, keeping image centered.
    
    REQUIRED for Physics ViT - this preprocessing was used during training.
    """
    background_color = tuple(background_color)
    width, height = pil_img.size
    if width == height:
        return pil_img
    elif width > height:
        result = Image.new(pil_img.mode, (width, width), background_color)
        result.paste(pil_img, (0, (width - height) // 2))
        return result
    else:
        result = Image.new(pil_img.mode, (height, height), background_color)
        result.paste(pil_img, ((height - width) // 2, 0))
        return result

Downstream Tasks

This model produces rich 1280-dimensional embeddings optimized for:

Physics Domain Classification: Use CLS token embeddings
Temporal Forecasting: Use pooled embeddings for trajectory prediction
Clustering & Similarity: Use CLS or pooled embeddings
Spatial Analysis: Use patch embeddings
Transfer Learning: Fine-tune embeddings for new physics domains

Performance

The model has been evaluated against DINO v2 and CLIP on physics-specific tasks:

Classification: Superior performance on physics domain classification
Temporal Forecasting: Better prediction of physics evolution
Clustering: Clearer separation of physics simulation types
Transfer Learning: Robust features for new physics applications

Detailed benchmarks available in the original research.

Model Versions

Standard Version: 78,372 training steps - Good balance of performance and training efficiency
Extended Version: 195,930 training steps - Maximum performance, longer training

Installation

pip install transformers torch torchvision pillow

Limitations

Domain Specific: Optimized for physics simulations, may not generalize to natural images
Preprocessing Required: Must use expand_to_square preprocessing for correct results
Resolution: Optimized for 224×224 input images
Physics Domains: Trained on specific simulation types listed above

Citation

@misc{physics-vit-2025,
  title={PhySiViT : A Physics Simulation Vision Transformer},
  author={Jessica Ezemba, James Afful, Mei-Yu Wang},
  year={2025},
  howpublished={SC 2025 Research Poster},
  url={https://huggingface.co/JessicaE/physics-vit-standard}
}

Acknowledgments

Built using Cerebras ModelZoo
Trained on Cerebras CS-X systems and Bridges-2 GPUs (Pittsburgh Supercomputing Center)
Based on Vision Transformer architecture
This work was made possible thanks to the ByteBoost cybertraining program which is funded by the National Science Foundation Cybertraining awards: 2320990, 2320991, and 2320992, and the Neocortex project, the ACES platform, and the Ookami cluster.
The Neocortex project is supported by National Science Foundation award number 2005597.
The ACES (Accelerating Computing for Emerging Sciences) platform was funded by National Science Foundation award number 2112356.
The Ookami cluster is supported by National Science Foundation award number 1927880.

Downloads last month: 21

Safetensors

Model size

631M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support