Physics Foundation Vision Transformer (PhysicsViT-StandardVersion)
A Vision Transformer model trained on multi-physics simulation data for scientific computing applications. This model is specifically designed for understanding and analyzing physics simulations across multiple domains.
Model Version: Standard Version - Trained for 78,372 steps
Model Details
Model Description
- Developed by: PhysicsAlchemists Research Team
- Model type: Vision Transformer (ViT-Huge)
- License: MIT License
- Finetuned from model: Trained from scratch on physics simulation data
- Training Steps: 78,372 steps
Model Architecture
- Architecture: ViT-Huge (Feature Extraction)
- Hidden size: 1280
- Number of layers: 32
- Number of attention heads: 16
- Intermediate size: 5120
- Image size: 224×224
- Patch size: 16×16
- Embedding dimension: 1280
Training Details
Training Data
The model was trained on a comprehensive dataset of physics simulations including:
- Acoustic scattering (inclusions, discontinuous, maze)
- Active matter simulations
- Euler equations (multi-quadrants with open/periodic BC)
- Gray-Scott reaction-diffusion
- Helmholtz staircase
- Planetary shallow water equations
- Rayleigh-Bénard convection (standard and uniform)
- Shear flow dynamics
- Turbulent radiative layer (2D)
- Viscoelastic instability
Training Configuration
- Training regime: 78,372 steps
- Batch size: 1,470
- Learning rate: 0.0005 (with warmup and cosine decay)
- Optimizer: Adam (β₁=0.9, β₂=0.999, weight_decay=0.0003)
- Mixed precision: bfloat16
- Hardware: Cerebras CS-X systems
Data Augmentation
- Random colormap application (viridis, plasma, inferno, coolwarm)
- Grayscale conversion (30% probability)
- Temporal trajectory preservation during training
Usage
⚠️ Important: This model requires specific preprocessing that differs from standard ViT models.
Basic Usage
from transformers import AutoModel, AutoImageProcessor
from PIL import Image
import torch
# Load model and processor
model = AutoModel.from_pretrained("JessicaE/physics-vit-standard")
processor = AutoImageProcessor.from_pretrained("JessicaE/physics-vit-standard")
# Load your physics image
image = Image.open("physics_simulation.png").convert('RGB')
# Apply custom preprocessing
image = expand_to_square(image, background_color=(128, 128, 128))
image = image.resize((224, 224), Image.BILINEAR)
# Convert to tensor and add batch dimension
from torchvision import transforms
tensor = transforms.ToTensor()(image).unsqueeze(0)
# Extract physics-aware embeddings
with torch.no_grad():
outputs = model(pixel_values=tensor)
# CLS token embedding (best for classification tasks)
cls_embedding = outputs.last_hidden_state[:, 0, :] # Shape: [1, 1280]
# Average pooled embedding (good for trajectory prediction)
pooled_embedding = outputs.last_hidden_state.mean(dim=1) # Shape: [1, 1280]
# Patch embeddings (for spatial analysis)
patch_embeddings = outputs.last_hidden_state[:, 1:, :] # Shape: [1, 196, 1280]
print(f"CLS embedding shape: {cls_embedding.shape}")
Required Preprocessing Function
from PIL import Image
def expand_to_square(pil_img, background_color):
"""
Pad image to square with background color, keeping image centered.
REQUIRED for Physics ViT - this preprocessing was used during training.
"""
background_color = tuple(background_color)
width, height = pil_img.size
if width == height:
return pil_img
elif width > height:
result = Image.new(pil_img.mode, (width, width), background_color)
result.paste(pil_img, (0, (width - height) // 2))
return result
else:
result = Image.new(pil_img.mode, (height, height), background_color)
result.paste(pil_img, ((height - width) // 2, 0))
return result
Downstream Tasks
This model produces rich 1280-dimensional embeddings optimized for:
- Physics Domain Classification: Use CLS token embeddings
- Temporal Forecasting: Use pooled embeddings for trajectory prediction
- Clustering & Similarity: Use CLS or pooled embeddings
- Spatial Analysis: Use patch embeddings
- Transfer Learning: Fine-tune embeddings for new physics domains
Performance
The model has been evaluated against DINO v2 and CLIP on physics-specific tasks:
- Classification: Superior performance on physics domain classification
- Temporal Forecasting: Better prediction of physics evolution
- Clustering: Clearer separation of physics simulation types
- Transfer Learning: Robust features for new physics applications
Detailed benchmarks available in the original research.
Model Versions
- Standard Version: 78,372 training steps - Good balance of performance and training efficiency
- Extended Version: 195,930 training steps - Maximum performance, longer training
Installation
pip install transformers torch torchvision pillow
Limitations
- Domain Specific: Optimized for physics simulations, may not generalize to natural images
- Preprocessing Required: Must use expand_to_square preprocessing for correct results
- Resolution: Optimized for 224×224 input images
- Physics Domains: Trained on specific simulation types listed above
Citation
@misc{physics-vit-2025,
title={PhySiViT : A Physics Simulation Vision Transformer},
author={Jessica Ezemba, James Afful, Mei-Yu Wang},
year={2025},
howpublished={SC 2025 Research Poster},
url={https://huggingface.co/JessicaE/physics-vit-standard}
}
Acknowledgments
- Built using Cerebras ModelZoo
- Trained on Cerebras CS-X systems and Bridges-2 GPUs (Pittsburgh Supercomputing Center)
- Based on Vision Transformer architecture
- This work was made possible thanks to the ByteBoost cybertraining program which is funded by the National Science Foundation Cybertraining awards: 2320990, 2320991, and 2320992, and the Neocortex project, the ACES platform, and the Ookami cluster.
- The Neocortex project is supported by National Science Foundation award number 2005597.
- The ACES (Accelerating Computing for Emerging Sciences) platform was funded by National Science Foundation award number 2112356.
- The Ookami cluster is supported by National Science Foundation award number 1927880.
- Downloads last month
- 21
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support