JessicaE's picture
Upload Physics ViT model
f8f1d04 verified
|
raw
history blame
6.04 kB

Physics Foundation Vision Transformer (PhysicsViT-StandardVersion)

A Vision Transformer model trained on multi-physics simulation data for scientific computing applications. This model is specifically designed for understanding and analyzing physics simulations across multiple domains.

Model Version: Standard Version - Trained for ~1.2 epochs (78,372 steps)

Model Details

Model Description

  • Developed by: PhysicsAlchemists Research Team
  • Model type: Vision Transformer (ViT-Huge)
  • Language(s): N/A (Computer Vision)
  • License: Apache 2.0
  • Finetuned from model: Trained from scratch on physics simulation data
  • Training Steps: 78,372 steps

Model Architecture

  • Architecture: ViT-Huge (Feature Extraction)
  • Hidden size: 1280
  • Number of layers: 32
  • Number of attention heads: 16
  • Intermediate size: 5120
  • Image size: 224×224
  • Patch size: 16×16
  • Embedding dimension: 1280

Training Details

Training Data

The model was trained on a comprehensive dataset of physics simulations including:

  • Acoustic scattering (inclusions, discontinuous, maze)
  • Active matter simulations
  • Euler equations (multi-quadrants with open/periodic BC)
  • Gray-Scott reaction-diffusion
  • Helmholtz staircase
  • Planetary shallow water equations
  • Rayleigh-Bénard convection (standard and uniform)
  • Shear flow dynamics
  • Turbulent radiative layer (2D)
  • Viscoelastic instability

Training Configuration

  • Training regime: ~1.2 epochs (78,372 steps)
  • Batch size: 1,470
  • Learning rate: 0.0005 (with warmup and cosine decay)
  • Optimizer: Adam (β₁=0.9, β₂=0.999, weight_decay=0.0003)
  • Mixed precision: bfloat16
  • Hardware: Cerebras CS-X systems

Data Augmentation

  • Random colormap application (viridis, plasma, inferno, coolwarm)
  • Grayscale conversion (30% probability)
  • Temporal trajectory preservation during training

Usage

⚠️ Important: This model requires specific preprocessing that differs from standard ViT models.

Basic Usage

from transformers import AutoModel, AutoImageProcessor
from PIL import Image
import torch

# Load model and processor
model = AutoModel.from_pretrained("your-username/physics-vit-standard")
processor = AutoImageProcessor.from_pretrained("your-username/physics-vit-standard")

# Load your physics image
image = Image.open("physics_simulation.png").convert('RGB')

# ⚠️ CRITICAL: Apply custom preprocessing
image = expand_to_square(image, background_color=(128, 128, 128))
image = image.resize((224, 224), Image.BILINEAR)

# Convert to tensor and add batch dimension
from torchvision import transforms
tensor = transforms.ToTensor()(image).unsqueeze(0)

# Extract physics-aware embeddings
with torch.no_grad():
    outputs = model(pixel_values=tensor)
    
    # CLS token embedding (best for classification tasks)
    cls_embedding = outputs.last_hidden_state[:, 0, :]  # Shape: [1, 1280]
    
    # Average pooled embedding (good for trajectory prediction)  
    pooled_embedding = outputs.last_hidden_state.mean(dim=1)  # Shape: [1, 1280]
    
    # Patch embeddings (for spatial analysis)
    patch_embeddings = outputs.last_hidden_state[:, 1:, :]  # Shape: [1, 196, 1280]

print(f"CLS embedding shape: {cls_embedding.shape}")

Required Preprocessing Function

from PIL import Image

def expand_to_square(pil_img, background_color):
    """
    Pad image to square with background color, keeping image centered.
    
    REQUIRED for Physics ViT - this preprocessing was used during training.
    """
    background_color = tuple(background_color)
    width, height = pil_img.size
    if width == height:
        return pil_img
    elif width > height:
        result = Image.new(pil_img.mode, (width, width), background_color)
        result.paste(pil_img, (0, (width - height) // 2))
        return result
    else:
        result = Image.new(pil_img.mode, (height, height), background_color)
        result.paste(pil_img, ((height - width) // 2, 0))
        return result

Downstream Tasks

This model produces rich 1280-dimensional embeddings optimized for:

  • Physics Domain Classification: Use CLS token embeddings
  • Temporal Forecasting: Use pooled embeddings for trajectory prediction
  • Clustering & Similarity: Use CLS or pooled embeddings
  • Spatial Analysis: Use patch embeddings
  • Transfer Learning: Fine-tune embeddings for new physics domains

Performance

The model has been evaluated against DINO v2 and CLIP on physics-specific tasks:

  • Classification: Superior performance on physics domain classification
  • Temporal Forecasting: Better prediction of physics evolution
  • Clustering: Clearer separation of physics simulation types
  • Transfer Learning: Robust features for new physics applications

Detailed benchmarks available in the original research.

Model Versions

  • Standard Version: 78,372 training steps (~1.2 epochs) - Good balance of performance and training efficiency
  • Extended Version: 195,930 training steps (3 full epochs) - Maximum performance, longer training

Installation

pip install transformers torch torchvision pillow

Limitations

  • Domain Specific: Optimized for physics simulations, may not generalize to natural images
  • Preprocessing Required: Must use expand_to_square preprocessing for correct results
  • Resolution: Optimized for 224×224 input images
  • Physics Domains: Trained on specific simulation types listed above

Citation

@misc{physics-vit-2024,
  title={Physics Foundation Vision Transformer for Scientific Computing},
  author={PhysicsAlchemists Research Team},
  year={2024},
  howpublished={HuggingFace Model Hub},
  url={https://huggingface.co/your-username/physics-vit-standard}
}

Acknowledgments

  • Built using Cerebras ModelZoo
  • Trained on Cerebras CS-X systems
  • Based on Vision Transformer architecture