ibm-esa-geospatial
/

TerraMind-1.0-large

 ---
 license: apache-2.0
 library_name: terratorch
+datasets:
+- ibm-esa-geospatial/TerraMesh
+tags:
+  - Earth Observation
+  - TerraMind
+  - IBM
+  - ESA
 ---
+[**Paper**](https://arxiv.org/abs/2504.11171)
+| [**Examples**](https://github.com/IBM/terramind)
+| [**Model Code**](https://github.com/IBM/terratorch/tree/main/terratorch/models/backbones/terramind)
+| [**ESA Blog**](todo)
+| [**IBM Blog**](todo)
+# TerraMind 1.0 large
+TerraMind is the first multimodal any-to-any generative foundation model for Earth Observation jointly developed by IBM, ESA, and Forschungszentrum Jülich.
+![terramind_architecture.png](assets%2Fterramind_architecture.png)
+## Architecture
+TerraMind uses a dual-scale transformer-based encoder-decoder architecture, simultaneously processing pixel-level and token-level data.
+The model was pre-trained on 500B tokens from 9M spatiotemporally aligned multimodal samples from the TerraMesh dataset.
+Modality-specific patch embeddings allow direct processing of raw inputs, while modality-specific FSQ-VAEs are used for image tokenization.
+For sequence-like modalities such as coordinates, an adapted WordPiece tokenizer is employed.
+During pre-training, TerraMind leverages masked token reconstruction, learning complex cross-modal correlations to generate high-quality latent representations.
+## Evaluation
+![terramind_evaluation.png](assets%2Fterramind_evaluation.png)
+We benchmarked TerraMind against other geospatial foundation models using the PANGAEA benchmark.
+TerraMind consistently achieved state-of-the-art performance, surpassing existing models in various downstream tasks such as land use segmentation, water body mapping, and vegetation assessments.
+The evaluation highlights its effectiveness in handling diverse Earth Observation scenarios.
+We present additional experiments in our [pre-print](https://arxiv.org/abs/2504.11171).
+## Usage
+TerraMind is fully integrated into the fine-tuning package [TerraTorch](https://ibm.github.io/terratorch/).
+This makes it easy to initialize the pre-trained model or fine-tune it via PyTorch Lightning.
+The weights are automatically downloaded from Hugging Face.
+### Fine-tuning
+You can fine-tune TerraMind with a config using TerraTorch:
+```shell
+terratorch fit -c terramind_config.yaml
+```
+For testing the fine-tuned TerraMind model, run:
+```shell
+terratorch test -c terramind_config.yaml --ckpt_path path/to/your/checkpoint.ckpt
+```
+We provide config examples and notebooks with step-by-step explanations at https://github.com/IBM/terramind.
+### Backbone
+Alternatively, you can build the backbone with the following code and use it in your custom pipeline.
+```python
+from terratorch import BACKBONE_REGISTRY
+model = BACKBONE_REGISTRY.build(
+    'terramind_v1_large',
+    pretrained=True,
+    modalities=['S2L2A', 'S1GRD']
+)
+```
+The model supports the following raw inputs which you can specify in `modalities`: S2L2A, S2L1C, S1GRD, S1RTC, DEM, RGB.
+If your data does not use all bands of a modality, you can specify a subset with `bands={'S2L2A': ['BLUE', 'GREEN', 'RED', 'NIR_NARROW', 'SWIR_1', 'SWIR_2']}`.
+You can pass the inputs as in a dict to the model. If a tensor is directly passed, the model assumes it is the first defined modality.
+TerraMind can also handle missing input modalities.
+```python
+output = model(
+  {
+    'S2L2A': s2l2a_tensor,  # B, 12, 224, 224
+    'S1GRD': s1grd_tensor,  # B, 2, 224, 224
+  }
+)
+output.shape  # B, 196, 768
+```
+The model outputs patch embeddings for each input modality. By default, the patch embeddings are averaged over all modalities to reduce the output size.
+You can specify another `merge_method` from `'mean'`, `'max'`, `'concat'`, `'dict'`, and `None`.
+- `mean` and `max` are applied per patch over all image modality embeddings.
+- `concat` stacks all image modalities along the embedding dimension and returns one embedding per patch.
+- `dict` returns all tokens split by modality in a dictionary.
+- `None` returns the tokens without further processing.
+### Thinking in Modalities
+TerraMind introduces a new Thinking-in-Modalities (TiM) approach, where other modalities are predicted as an intermediate steps.
+Then, the fine-tuned encoder uses both raw inputs and the generated modalities.
+Use TiM models in TerraTorch by adding `_tim` to the model name:
+```python
+from terratorch import BACKBONE_REGISTRY
+model = BACKBONE_REGISTRY.build(
+    'terramind_v1_large_tim',
+    pretrained=True,
+    modalities=['S2L2A', 'S1GRD'],
+    tim_modalities=['LULC']  # optional, defaults to LULC (land-use land-cover)
+)
+```
+If you use TiM models, we recommend using the [pre-training statistics](https://github.com/IBM/terratorch/blob/a4ca8df7c7f22ddf469f372e1099157d2d7beeb2/terratorch/models/backbones/terramind/model/terramind_register.py#L111) for standardization.
+### Generations
+TerraMind can perform any-to-any generation based on varying combinations of inputs.
+![terramind_generations.png](assets%2Fterramind_generations.png)
+Build the full TerraMind model (including de-tokenizer steps) from the `FULL_MODEL_REGISTRY`:
+```python
+from terratorch import FULL_MODEL_REGISTRY
+model = FULL_MODEL_REGISTRY.build(
+    'terramind_v1_large_generate',
+    pretrained=False,
+    modalities=['S2L2A'],
+    output_modalities=['S1GRD', 'LULC'],
+    timesteps=10,  # Define diffusion steps
+    standardize=True,  # Apply standardization
+)
+```
+Like the backbone, pass multiple modalities as a dict or a single modality as a tensor to the model which returns the generated `output_modalities` as a dict of tensors.
+Note: These generations are not reconstructions but "mental images" representing how the model imagines the modality.
+You can control generation details via the number of diffusion steps (`timesteps`) that you can pass to the constructor or the forward function.
+By passing `standardize=True`, the pre-training standardization values are automatically applied to the input and output.
+We provide an example notebook for generations at https://github.com/IBM/terramind.
+## Feedback
+Your feedback is invaluable to us.
+Please share it with us by starting a discussion in this HF repository or submitting an issue to [TerraMind](https://github.com/IBM/terramind) on GitHub.
+## Citation
+If you use TerraMind in your research, please cite the [TerraMind](https://arxiv.org/abs/2504.11171) pre-print.
+```text
+@article{jakubik2025terramind,
+  title={TerraMind: Large-Scale Generative Multimodality for Earth Observation},
+  author={Jakubik, Johannes and Yang, Felix and Blumenstiel, Benedikt and Scheurer, Erik and Sedona, Rocco and Maurogiovanni, Stefano and Bosmans, Jente and Dionelis, Nikolaos and Marsocci, Valerio and Kopp, Niklas and others},
+  journal={arXiv preprint arXiv:2504.11171},
+  year={2025}
+}
+```