EnhancedViTUNet for Front-to-BEV Prediction

This model takes a front-view RGB image and predicts a Bird’s-Eye View (BEV) image.

Architecture: Vision Transformer (ViT) encoder + U-Net style decoder
Training: On synthetic Gazebo11 simulation dataset with ROI-masked L1 + perceptual VGG loss
Input size: 384×384 RGB
Output size: 384×384 RGB BEV