Update README.md

f387280 verified 15 days ago

7.28 kB

	---
	license: apache-2.0
	library_name: terratorch
	datasets:
	- ibm-esa-geospatial/TerraMesh
	tags:
	- Earth Observation
	- TerraMind
	- IBM
	- ESA
	---
	[Paper](https://arxiv.org/abs/2504.11171)
	\| [Examples](https://github.com/IBM/terramind)
	\| [Model Code](https://github.com/IBM/terratorch/tree/main/terratorch/models/backbones/terramind)
	\| [ESA Blog](https://www.esa.int/Applications/Observing_the_Earth/ESA_and_IBM_collaborate_on_TerraMind)
	\| [IBM Blog](https://research.ibm.com/blog/terramind-esa-earth-observation-model)
	\| [Challenge](https://huggingface.co/spaces/ibm-esa-geospatial/challenge)

	# TerraMind 1.0 large

	TerraMind is the first multimodal any-to-any generative foundation model for Earth Observation jointly developed by IBM, ESA, and Forschungszentrum Jülich.


	![terramind_architecture.png](assets%2Fterramind_architecture.png)

	## Architecture

	TerraMind uses a dual-scale transformer-based encoder-decoder architecture, simultaneously processing pixel-level and token-level data.
	The model was pre-trained on 500B tokens from 9M spatiotemporally aligned multimodal samples from the TerraMesh dataset.

	Modality-specific patch embeddings allow direct processing of raw inputs, while modality-specific FSQ-VAEs are used for image tokenization.
	For sequence-like modalities such as coordinates, an adapted WordPiece tokenizer is employed.
	During pre-training, TerraMind leverages masked token reconstruction, learning complex cross-modal correlations to generate high-quality latent representations.

	## Evaluation

	![terramind_evaluation.png](assets%2Fterramind_evaluation.png)

	We benchmarked TerraMind against other geospatial foundation models using the PANGAEA benchmark.
	TerraMind consistently achieved state-of-the-art performance, surpassing existing models in various downstream tasks such as land use segmentation, water body mapping, and vegetation assessments.
	The evaluation highlights its effectiveness in handling diverse Earth Observation scenarios.
	We present additional experiments in our [pre-print](https://arxiv.org/abs/2504.11171).


	## Usage

	TerraMind is fully integrated into the fine-tuning package [TerraTorch](https://ibm.github.io/terratorch/).
	This makes it easy to initialize the pre-trained model or fine-tune it via PyTorch Lightning.
	The weights are automatically downloaded from Hugging Face.

	### Fine-tuning

	You can fine-tune TerraMind with a config using TerraTorch:

	```shell
	terratorch fit -c terramind_config.yaml
	```

	For testing the fine-tuned TerraMind model, run:
	```shell
	terratorch test -c terramind_config.yaml --ckpt_path path/to/your/checkpoint.ckpt
	```

	We provide config examples and notebooks with step-by-step explanations at https://github.com/IBM/terramind.

	### Backbone

	Alternatively, you can build the backbone with the following code and use it in your custom pipeline.

	```python
	from terratorch import BACKBONE_REGISTRY
	model = BACKBONE_REGISTRY.build(
	'terramind_v1_large',
	pretrained=True,
	modalities=['S2L2A', 'S1GRD']
	)
	```

	The model supports the following raw inputs which you can specify in `modalities`: S2L2A, S2L1C, S1GRD, S1RTC, DEM, RGB.
	If your data does not use all bands of a modality, you can specify a subset with `bands={'S2L2A': ['BLUE', 'GREEN', 'RED', 'NIR_NARROW', 'SWIR_1', 'SWIR_2']}`.
	You can pass the inputs as in a dict to the model. If a tensor is directly passed, the model assumes it is the first defined modality.
	TerraMind can also handle missing input modalities.

	```python
	output = model(
	{
	'S2L2A': s2l2a_tensor, # B, 12, 224, 224
	'S1GRD': s1grd_tensor, # B, 2, 224, 224
	}
	)

	output.shape # B, 196, 768
	```

	The model outputs patch embeddings for each input modality. By default, the patch embeddings are averaged over all modalities to reduce the output size.
	You can specify another `merge_method` from `'mean'`, `'max'`, `'concat'`, `'dict'`, and `None`.
	- `mean` and `max` are applied per patch over all image modality embeddings.
	- `concat` stacks all image modalities along the embedding dimension and returns one embedding per patch.
	- `dict` returns all tokens split by modality in a dictionary.
	- `None` returns the tokens without further processing.

	### Thinking in Modalities

	TerraMind introduces a new Thinking-in-Modalities (TiM) approach, where other modalities are predicted as an intermediate steps.
	Then, the fine-tuned encoder uses both raw inputs and the generated modalities.

	Use TiM models in TerraTorch by adding `_tim` to the model name:
	```python
	from terratorch import BACKBONE_REGISTRY
	model = BACKBONE_REGISTRY.build(
	'terramind_v1_large_tim',
	pretrained=True,
	modalities=['S2L2A', 'S1GRD'],
	tim_modalities=['LULC'] # optional, defaults to LULC (land-use land-cover)
	)
	```

	If you use TiM models, we recommend using the [pre-training statistics](https://github.com/IBM/terratorch/blob/a4ca8df7c7f22ddf469f372e1099157d2d7beeb2/terratorch/models/backbones/terramind/model/terramind_register.py#L111) for standardization.

	### Generations

	TerraMind can perform any-to-any generation based on varying combinations of inputs.

	![terramind_generations.png](assets%2Fterramind_generations.png)

	Build the full TerraMind model (including de-tokenizer steps) from the `FULL_MODEL_REGISTRY`:

	```python
	from terratorch import FULL_MODEL_REGISTRY

	model = FULL_MODEL_REGISTRY.build(
	'terramind_v1_large_generate',
	pretrained=False,
	modalities=['S2L2A'],
	output_modalities=['S1GRD', 'LULC'],
	timesteps=10, # Define diffusion steps
	standardize=True, # Apply standardization
	)
	```
	Like the backbone, pass multiple modalities as a dict or a single modality as a tensor to the model which returns the generated `output_modalities` as a dict of tensors.
	Note: These generations are not reconstructions but "mental images" representing how the model imagines the modality.
	You can control generation details via the number of diffusion steps (`timesteps`) that you can pass to the constructor or the forward function.
	By passing `standardize=True`, the pre-training standardization values are automatically applied to the input and output.

	We provide an example notebook for generations at https://github.com/IBM/terramind.

	## Feedback

	Your feedback is invaluable to us.
	Please share it with us by starting a discussion in this HF repository or submitting an issue to [TerraMind](https://github.com/IBM/terramind) on GitHub.

	## Challenge

	Already working with TerraMind? Submit your use case to the [TerraMind Blue-Sky Challenge](https://huggingface.co/spaces/ibm-esa-geospatial/challenge), a bi-monthly award spotlighting the boldest, most imaginative ways using TerraMind.

	## Citation

	If you use TerraMind in your research, please cite the [TerraMind](https://arxiv.org/abs/2504.11171) pre-print.

	```text
	@article{jakubik2025terramind,
	title={TerraMind: Large-Scale Generative Multimodality for Earth Observation},
	author={Jakubik, Johannes and Yang, Felix and Blumenstiel, Benedikt and Scheurer, Erik and Sedona, Rocco and Maurogiovanni, Stefano and Bosmans, Jente and Dionelis, Nikolaos and Marsocci, Valerio and Kopp, Niklas and others},
	journal={arXiv preprint arXiv:2504.11171},
	year={2025}
	}
	```