--- license: apache-2.0 datasets: - Major-TOM/Core-S2L2A - Major-TOM/Core-S2L1C - Major-TOM/Core-S1RTC tags: - Earth Observation - Foundation Model - Remote Sensing --- # TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation

Oryx TerraFM

[![paper](https://img.shields.io/badge/arXiv-Paper-.svg)](https://arxiv.org/abs/2506.06281) [![code](https://img.shields.io/badge/GitHub-Code-blue.svg)](https://github.com/mbzuai-oryx/TerraFM) [![Model Zoo](https://img.shields.io/badge/Model%20Zoo-HuggingFace-blue)](#๐Ÿง -model-zoo) --- ## ๐Ÿ“ข Latest Updates - **Jun-09-25**: ๐Ÿš€ Initial release of **TerraFM codebase** and **pretrained models** - **Jun-09-25**: ๐Ÿ“„ Paper released on arXiv: [arxiv link](https://arxiv.org/abs/2506.06281). ๐Ÿ”ฅ๐Ÿ”ฅ --- ## ๐ŸŒ Overview **TerraFM** is a scalable foundation model designed for unified processing of multisensor Earth Observation (EO) data. Built on a ViT backbone and trained over **18.7M tiles (~23T pixels)** from Sentinel-1 SAR and Sentinel-2 optical imagery, TerraFM unifies modality-specific inputs using: - ๐Ÿงฉ Modality-specific patch embeddings - ๐ŸŒ€ Adaptive cross-attention fusion - ๐ŸŽฏ Dual-centering regularization for long-tailed distributions TerraFM sets a new benchmark on **GEO-Bench** and **Copernicus-Bench**, demonstrating strong generalization across geographies, modalities, and tasks โ€” including classification, segmentation, and landslide detection. --- ## ๐Ÿ”ฌ Key Features

TerraFM Architecture

- **Multimodal Pretraining**: Uses Sentinel-1 (SAR) and Sentinel-2 (L1C, L2A) as natural augmentations. - **Large-Scale Dataset**: Trained on 18.7M global tiles from the [Major-TOM](https://huggingface.co/Major-TOM) dataset. - **Cross-Attention Fusion**: Dynamically aggregates information across sensors at patch level. - **Dual-Centering**: Mitigates long-tailed land cover bias using ESA WorldCover statistics. - **Benchmark SOTA**: Outperforms prior FMs (Galileo, Prithvi, DOFA) across multiple EO tasks. --- ## ๐Ÿงฑ Architecture

TerraFM Architecture

Overall architecture of TerraFM. It unifies student-teacher contrastive framework with modality augmentation with cross-attention fusion, and a new dual centering regularization. TerraFM is founded on ViT backbone and is trained on 18.7M globally distributed samples for pre-training and utilizes large-tile inputs for encoding broader spatial context. For illustration, RGB channels from S2-L2A and S2-L1C are selected, and S1 is visualized using a false-color RGB composite. --- ## ๐Ÿง  Model Zoo | Model | Modality | Input Size | Backbone | Link | |-------|----------|------------|--------|------| | TerraFM-B | Sentinel-1 RTC + Sentinel-2 Level 2A + Sentinel-2 Level 1C | 224ร—224 | ViT-Base | [Download](https://huggingface.co/MBZUAI/TerraFM) | | TerraFM-L | Sentinel-1 RTC + Sentinel-2 Level 2A + Sentinel-2 Level 1C | 224ร—224 | ViT-Large | [Download](https://huggingface.co/MBZUAI/TerraFM) | --- ## ๐Ÿ›  Usage TerraFM can be used directly via the `terrafm.py` module, which provides standalone implementations of the TerraFM-Base and TerraFM-Large models for easy integration into any codebase. ```python from terrafm import terrafm_base, terrafm_large import torch # Simulated input: 1 sample, 12 channels, 224ร—224 resolution (e.g., Sentinel-2 L2A) x = torch.randn(1, 12, 224, 224) # Load TerraFM-Base model model = terrafm_base() # Load pretrained weights (e.g., TerraFM-B.pth) state_dict = torch.load("TerraFM-B.pth", map_location="cpu") msg = model.load_state_dict(state_dict, strict=False) # Forward pass y = model(x) print(f"Output shape: {y.shape}") ``` --- ## ๐Ÿ“Š Results ### ๐Ÿ” k-NN Classification Results We evaluate image classification using k-nearest neighbors (kNN) and report Top-1 accuracy for all single-label tasks. For the multilabel BigEarthNet benchmark, we report the F1 score. | Model | Backbone | m-EuroSat (100%) | m-EuroSat (1%) | m-BigEarthNet (100%) | m-BigEarthNet (1%) | m-So2Sat (100%) | m-So2Sat (1%) | m-Brick-Kiln (100%) | m-Brick-Kiln (1%) | |----------------|------------|------------------|----------------|------------------------|--------------------|------------------|----------------|----------------------|--------------------| | SatMAE | ViT-Base | 84.1 | 34.8 | 50.6 | 29.0 | 36.0 | 23.1 | 86.1 | 73.5 | | SatMAE++ | ViT-Large | 82.7 | 48.5 | 50.8 | 31.6 | 34.7 | 23.4 | 89.6 | 76.7 | | CROMA | ViT-Base | 85.6 | 51.3 | 58.8 | 44.7 | 48.8 | 33.8 | 92.6 | 85.1 | | SoftCon | ViT-Small | 89.8 | 27.2 | 64.7 | 43.3 | 51.1 | 31.4 | 89.2 | 77.8 | | DOFA | ViT-Base | 82.8 | 49.6 | 49.4 | 29.9 | 41.4 | 29.4 | 88.3 | 78.3 | | Satlas | Swin-Tiny | 81.7 | 35.8 | 51.9 | 29.6 | 36.6 | 27.1 | 88.2 | 73.0 | | MMEarth | CNN-atto | 81.7 | 30.0 | 58.3 | 39.6 | 39.8 | 25.1 | 89.4 | 79.7 | | DeCUR | ViT-Small | 89.0 | 46.6 | 63.8 | 49.6 | 45.8 | 30.9 | 83.7 | 74.2 | | AnySat | ViT-Base | 82.2 | 47.1 | 54.9 | 33.7 | 39.8 | 29.0 | 85.3 | 72.0 | | Galileo | ViT-Base | 93.0 | 56.6 | 59.0 | 36.5 | 54.8 | **43.2** | 90.7 | 78.0 | | Prithvi-2.0 | ViT-Large | 80.2 | 48.0 | 49.4 | 28.8 | 29.5 | 26.1 | 87.9 | 80.6 | | Copernicus-FM | ViT-Base | 76.0 | 47.4 | 53.8 | 33.3 | 38.4 | 23.3 | 93.0 | 83.2 | | **TerraFM** | ViT-Base | _94.2_ | _59.3_ | _68.7_ | 49.4 | _55.1_ | _41.6_ | **94.5** | **85.6** | |**TerraFM**| ViT-Large | **95.1** | **62.1** | **69.4** | **50.6** | **55.9** | 41.1 | _93.0_ | 82.2 | ### ๐Ÿ›ฐ Copernicus-Bench Comparison of TerraFM with existing supervised and self-supervised methods on **Copernicus-Bench**. Metrics include **OA** (Overall Accuracy), **mAP** (mean Average Precision), and **mIoU** (mean Intersection over Union). | Dataset | Metric | Supervised | Random | SoftCon | CROMA | DOFA | Copernicus-FM | **TerraFM** | |----------------|--------|------------|--------|---------|--------|------|----------------|-------------| | **Backbone** | -- | ViT-B/16 | ViT-B/16 | ViT-B/14 | ViT-B/8 | ViT-B/16 | ViT-B/16 | ViT-B/16 | | **Cloud-S2** | mIoU | 59.4 | 60.4 | 66.9 | 65.0 | 65.0 | 66.7 | **67.9** | | **EuroSAT-S1** | OA | 81.5 | 75.4 | 83.6 | 83.9 | 81.7 | 87.2 | **87.8** | | **EuroSAT-S2** | OA | 97.6 | 92.5 | 96.7 | 97.0 | 97.2 | 97.9 | **99.1** | | **BigEarthNet-S1** | mAP | 70.6 | 63.8 | **78.7**| 70.8 | 70.5 | 77.9 | 76.9 | | **BigEarthNet-S2** | mAP | 80.1 | 71.6 | 83.6 | 76.4 | 75.5 | 79.0 | **84.4** | | **DFC2020-S1** | mIoU | 50.8 | 45.4 | 52.8 | 52.7 | 49.7 | 52.4 | **55.4** | | **DFC2020-S2** | mIoU | 66.2 | 62.3 | 64.1 | **66.5**| 61.8 | 64.5 | 63.8 | | **LCZ-S2** | OA | 85.3 | 77.4 | 83.6 | 84.1 | 83.0 | 84.4 | **87.0** | ### ๐Ÿงช GEO-Bench Performance Performance comparison on GEO-Bench for both **classification** (Top-1 Accuracy), **segmentation** (mIoU), and **F1 score** (for m-BigEarthNet). TerraFM achieves state-of-the-art results across multiple datasets, outperforming previous foundation models. | Method | Backbone | m-EuroSat | m-BigEarthNet | m-So2Sat | m-Brick-Kiln | m-Cashew-Plant | m-SA-Crop-Type | |--------------|------------|-----------|----------------|----------|----------------|------------------|------------------| | SatMAE | ViT-Large | 96.6 | 68.3 | 57.2 | 98.4 | 30.8 | 24.8 | | SatMAE++ | ViT-Large | 96.5 | 67.9 | 56.0 | 98.6 | 29.6 | 25.7 | | CROMA | ViT-Large | 96.6 | 71.9 | 60.6 | 98.7 | 31.8 | 32.0 | | SoftCon | ViT-Base | 97.5 | 70.3 | 61.7 | 98.7 | 29.6 | 30.8 | | DOFA | ViT-Large | 96.9 | 68.0 | 58.7 | 98.6 | 27.7 | 25.4 | | Satlas | Swin-Base | 97.5 | 72.8 | 61.9 | **98.9** | 25.1 | 23.4 | | MMEarth | CNN-atto | 95.7 | 70.0 | 57.2 | 98.9 | 24.2 | 22.2 | | DeCUR | ViT-Small | 97.9 | 70.9 | 61.7 | 98.7 | 26.2 | 21.5 | | Prithvi 2.0 | ViT-Large | 96.5 | 69.0 | 54.6 | 98.6 | 26.7 | 22.9 | | AnySat | ViT-Base | 95.9 | 70.3 | 51.8 | 98.6 | 26.1 | 27.1 | | Galileo | ViT-Base | 97.7 | 70.7 | 63.3 | 98.7 | 33.0 | 30.1 | | **TerraFM** | ViT-Base | *98.1* | 72.6 | *64.9* | 98.7 | *34.1* | *33.0* | | **TerraFM** | ViT-Large | **98.6** | **73.1** | **66.6** | **99.0** | **37.2** | **34.5** | ### ๐ŸŒ‹ Landslide Detection (Landslide4Sense) Landslide detection performance on the **Landslide4Sense** test set. Despite having significantly fewer parameters (120M vs. 300M), **TerraFM** achieves higher overall segmentation performance, especially for landslide regions. | Model | mIoU | IoU (Landslide) | |------------------------|------|-----------------| | Prithvi-EO-2.0 (300M) | 65.0 | 31.5 | | **TerraFM (120M)** | **70.8** | **43.1** |

Landslide Detection

--- ## ๐Ÿ“œ Citation If you find our work and this repository useful, please consider giving our repo a star and citing our paper as follows: ```bibtex @article{danish2025terrafmscalablefoundationmodel, title={TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation}, author={Muhammad Sohail Danish and Muhammad Akhtar Munir and Syed Roshaan Ali Shah and Muhammad Haris Khan and Rao Muhammad Anwer and Jorma Laaksonen and Fahad Shahbaz Khan and Salman Khan}, year={2025}, eprint={2506.06281}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2506.06281}, } ``` ## ๐Ÿ“จ Contact If you have any questions, please create an issue on this repository or contact at muhammad.sohail@mbzuai.ac.ae.