Commit
·
0585329
1
Parent(s):
1258f5a
Update ReadMe
Browse files
README.md
CHANGED
@@ -20,7 +20,7 @@ Prithvi-EO-2.0 is based on the ViT architecture, pretrained using a masked autoe
|
|
20 |
|
21 |
First, we replaced the 2D patch embeddings and 2D positional embeddings with 3D versions to support inputs with spatiotemporal characteristics, i.e., a sequence of T images of size (H, W). Our 3D patch embeddings consist of a 3D convolutional layer, dividing the 3D input into non-overlapping cubes of size (t, h, w) for time, height, and width dimensions, respectively. For the 3D positional encodings, we first generate 1D sin/cos encodings individually for each dimension and then combine them together into a single, 3D positional encoding.
|
22 |
|
23 |
-
Second, we considered geolocation (center latitude and longitude) and date of acquisition (year and day-of-year ranging 1-365) in pretraining. Both encoder and decoder receive time and location information for each sample and encodes them independently using 2D sin/cos encoding. They are added to the embedded tokens via a weighted sum with learned weights: one for time and one for location and separate weights for encoder and decoder. Since this metadata is often not available, we added a drop mechanism during pretraining that randomly drops the geolocation and/or the temporal data to help the model learn how to handle the absence of this information.
|
24 |
|
25 |
## Pre-trained Models
|
26 |
|
|
|
20 |
|
21 |
First, we replaced the 2D patch embeddings and 2D positional embeddings with 3D versions to support inputs with spatiotemporal characteristics, i.e., a sequence of T images of size (H, W). Our 3D patch embeddings consist of a 3D convolutional layer, dividing the 3D input into non-overlapping cubes of size (t, h, w) for time, height, and width dimensions, respectively. For the 3D positional encodings, we first generate 1D sin/cos encodings individually for each dimension and then combine them together into a single, 3D positional encoding.
|
22 |
|
23 |
+
Second, we considered geolocation (center latitude and longitude) and date of acquisition (year and day-of-year ranging 1-365) in the pretraining of the TL model versions. Both encoder and decoder receive time and location information for each sample and encodes them independently using 2D sin/cos encoding. They are added to the embedded tokens via a weighted sum with learned weights: one for time and one for location and separate weights for encoder and decoder. Since this metadata is often not available, we added a drop mechanism during pretraining that randomly drops the geolocation and/or the temporal data to help the model learn how to handle the absence of this information.
|
24 |
|
25 |
## Pre-trained Models
|
26 |
|