TerraTorch
Earth Observation
TerraMind
IBM
ESA
blumenstiel commited on
Commit
6969b55
·
verified ·
1 Parent(s): 34ab640

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +158 -0
README.md CHANGED
@@ -1,4 +1,162 @@
1
  ---
2
  license: apache-2.0
3
  library_name: terratorch
 
 
 
 
 
 
 
4
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  library_name: terratorch
4
+ datasets:
5
+ - ibm-esa-geospatial/TerraMesh
6
+ tags:
7
+ - Earth Observation
8
+ - TerraMind
9
+ - IBM
10
+ - ESA
11
  ---
12
+ [**Paper**](https://arxiv.org/abs/2504.11171)
13
+ | [**Examples**](https://github.com/IBM/terramind)
14
+ | [**Model Code**](https://github.com/IBM/terratorch/tree/main/terratorch/models/backbones/terramind)
15
+ | [**ESA Blog**](todo)
16
+ | [**IBM Blog**](todo)
17
+
18
+ # TerraMind 1.0 large
19
+
20
+ TerraMind is the first multimodal any-to-any generative foundation model for Earth Observation jointly developed by IBM, ESA, and Forschungszentrum Jülich.
21
+
22
+
23
+ ![terramind_architecture.png](assets%2Fterramind_architecture.png)
24
+
25
+ ## Architecture
26
+
27
+ TerraMind uses a dual-scale transformer-based encoder-decoder architecture, simultaneously processing pixel-level and token-level data.
28
+ The model was pre-trained on 500B tokens from 9M spatiotemporally aligned multimodal samples from the TerraMesh dataset.
29
+
30
+ Modality-specific patch embeddings allow direct processing of raw inputs, while modality-specific FSQ-VAEs are used for image tokenization.
31
+ For sequence-like modalities such as coordinates, an adapted WordPiece tokenizer is employed.
32
+ During pre-training, TerraMind leverages masked token reconstruction, learning complex cross-modal correlations to generate high-quality latent representations.
33
+
34
+ ## Evaluation
35
+
36
+ ![terramind_evaluation.png](assets%2Fterramind_evaluation.png)
37
+
38
+ We benchmarked TerraMind against other geospatial foundation models using the PANGAEA benchmark.
39
+ TerraMind consistently achieved state-of-the-art performance, surpassing existing models in various downstream tasks such as land use segmentation, water body mapping, and vegetation assessments.
40
+ The evaluation highlights its effectiveness in handling diverse Earth Observation scenarios.
41
+ We present additional experiments in our [pre-print](https://arxiv.org/abs/2504.11171).
42
+
43
+
44
+ ## Usage
45
+
46
+ TerraMind is fully integrated into the fine-tuning package [TerraTorch](https://ibm.github.io/terratorch/).
47
+ This makes it easy to initialize the pre-trained model or fine-tune it via PyTorch Lightning.
48
+ The weights are automatically downloaded from Hugging Face.
49
+
50
+ ### Fine-tuning
51
+
52
+ You can fine-tune TerraMind with a config using TerraTorch:
53
+
54
+ ```shell
55
+ terratorch fit -c terramind_config.yaml
56
+ ```
57
+
58
+ For testing the fine-tuned TerraMind model, run:
59
+ ```shell
60
+ terratorch test -c terramind_config.yaml --ckpt_path path/to/your/checkpoint.ckpt
61
+ ```
62
+
63
+ We provide config examples and notebooks with step-by-step explanations at https://github.com/IBM/terramind.
64
+
65
+ ### Backbone
66
+
67
+ Alternatively, you can build the backbone with the following code and use it in your custom pipeline.
68
+
69
+ ```python
70
+ from terratorch import BACKBONE_REGISTRY
71
+ model = BACKBONE_REGISTRY.build(
72
+ 'terramind_v1_large',
73
+ pretrained=True,
74
+ modalities=['S2L2A', 'S1GRD']
75
+ )
76
+ ```
77
+
78
+ The model supports the following raw inputs which you can specify in `modalities`: S2L2A, S2L1C, S1GRD, S1RTC, DEM, RGB.
79
+ If your data does not use all bands of a modality, you can specify a subset with `bands={'S2L2A': ['BLUE', 'GREEN', 'RED', 'NIR_NARROW', 'SWIR_1', 'SWIR_2']}`.
80
+ You can pass the inputs as in a dict to the model. If a tensor is directly passed, the model assumes it is the first defined modality.
81
+ TerraMind can also handle missing input modalities.
82
+
83
+ ```python
84
+ output = model(
85
+ {
86
+ 'S2L2A': s2l2a_tensor, # B, 12, 224, 224
87
+ 'S1GRD': s1grd_tensor, # B, 2, 224, 224
88
+ }
89
+ )
90
+
91
+ output.shape # B, 196, 768
92
+ ```
93
+
94
+ The model outputs patch embeddings for each input modality. By default, the patch embeddings are averaged over all modalities to reduce the output size.
95
+ You can specify another `merge_method` from `'mean'`, `'max'`, `'concat'`, `'dict'`, and `None`.
96
+ - `mean` and `max` are applied per patch over all image modality embeddings.
97
+ - `concat` stacks all image modalities along the embedding dimension and returns one embedding per patch.
98
+ - `dict` returns all tokens split by modality in a dictionary.
99
+ - `None` returns the tokens without further processing.
100
+
101
+ ### Thinking in Modalities
102
+
103
+ TerraMind introduces a new Thinking-in-Modalities (TiM) approach, where other modalities are predicted as an intermediate steps.
104
+ Then, the fine-tuned encoder uses both raw inputs and the generated modalities.
105
+
106
+ Use TiM models in TerraTorch by adding `_tim` to the model name:
107
+ ```python
108
+ from terratorch import BACKBONE_REGISTRY
109
+ model = BACKBONE_REGISTRY.build(
110
+ 'terramind_v1_large_tim',
111
+ pretrained=True,
112
+ modalities=['S2L2A', 'S1GRD'],
113
+ tim_modalities=['LULC'] # optional, defaults to LULC (land-use land-cover)
114
+ )
115
+ ```
116
+
117
+ If you use TiM models, we recommend using the [pre-training statistics](https://github.com/IBM/terratorch/blob/a4ca8df7c7f22ddf469f372e1099157d2d7beeb2/terratorch/models/backbones/terramind/model/terramind_register.py#L111) for standardization.
118
+
119
+ ### Generations
120
+
121
+ TerraMind can perform any-to-any generation based on varying combinations of inputs.
122
+
123
+ ![terramind_generations.png](assets%2Fterramind_generations.png)
124
+
125
+ Build the full TerraMind model (including de-tokenizer steps) from the `FULL_MODEL_REGISTRY`:
126
+
127
+ ```python
128
+ from terratorch import FULL_MODEL_REGISTRY
129
+
130
+ model = FULL_MODEL_REGISTRY.build(
131
+ 'terramind_v1_large_generate',
132
+ pretrained=False,
133
+ modalities=['S2L2A'],
134
+ output_modalities=['S1GRD', 'LULC'],
135
+ timesteps=10, # Define diffusion steps
136
+ standardize=True, # Apply standardization
137
+ )
138
+ ```
139
+ Like the backbone, pass multiple modalities as a dict or a single modality as a tensor to the model which returns the generated `output_modalities` as a dict of tensors.
140
+ Note: These generations are not reconstructions but "mental images" representing how the model imagines the modality.
141
+ You can control generation details via the number of diffusion steps (`timesteps`) that you can pass to the constructor or the forward function.
142
+ By passing `standardize=True`, the pre-training standardization values are automatically applied to the input and output.
143
+
144
+ We provide an example notebook for generations at https://github.com/IBM/terramind.
145
+
146
+ ## Feedback
147
+
148
+ Your feedback is invaluable to us.
149
+ Please share it with us by starting a discussion in this HF repository or submitting an issue to [TerraMind](https://github.com/IBM/terramind) on GitHub.
150
+
151
+ ## Citation
152
+
153
+ If you use TerraMind in your research, please cite the [TerraMind](https://arxiv.org/abs/2504.11171) pre-print.
154
+
155
+ ```text
156
+ @article{jakubik2025terramind,
157
+ title={TerraMind: Large-Scale Generative Multimodality for Earth Observation},
158
+ author={Jakubik, Johannes and Yang, Felix and Blumenstiel, Benedikt and Scheurer, Erik and Sedona, Rocco and Maurogiovanni, Stefano and Bosmans, Jente and Dionelis, Nikolaos and Marsocci, Valerio and Kopp, Niklas and others},
159
+ journal={arXiv preprint arXiv:2504.11171},
160
+ year={2025}
161
+ }
162
+ ```