File size: 8,438 Bytes
a7d8629 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 |
---
license: cc-by-nc-4.0
---
# π core-dino | Resolution-Agnostic Self-Supervised Learning on Satellite Imagery
[](https://huggingface.co/gajeshladhar/core-dino)
[](https://colab.research.google.com/drive/1JvSx0AERGWoc8vAZAxOSPOOG7PAh6KqZ)


---
## π Overview
`core-dino` is a resolution-agnostic **self-supervised model** designed for satellite imagery, trained on the [Core-Five dataset](https://huggingface.co/datasets/gajeshladhar/core-five) using a DiNO-inspired setup. It handles imagery between **20β―cm and 2β―m**, learning strong spatial features without any labels.
<p>
<a href="https://colab.research.google.com/drive/1JvSx0AERGWoc8vAZAxOSPOOG7PAh6KqZ" target="_blank">
<b>Open Demo βΆοΈ</b>
</a> - Run multi-resolution inference & visualize spatial embeddings.</p>
---
## π Quickstart
```python
from ultralytics import YOLO
model = YOLO("yolo11x-obb.pt") # obb, bbox or seg any model
ckpt = "https://huggingface.co/gajeshladhar/core-dino/resolve/main/checkpoints/student.pt"
ckpt = torch.hub.load_state_dict_from_url(ckpt, map_location='cpu')
model.model.load_state_dict(
{k.replace('layers.', 'model.'): v for k, v in ckpt.items()},
strict=False)
```
---
## π§ Architecture: DiNO Γ YOLO Γ I-JEPA
We combine three ideas to build a high-performance backbone for spatial representation learning:
#### 1οΈβ£ **Multi-Resolution DINO Setup (instead of local-global)**
> In standard [DINO](https://arxiv.org/abs/2104.14294) / [DINOv2](https://arxiv.org/abs/2304.07193), the student sees cropped or distorted views (local), while the teacher sees global views.
> In `core-dino`, we replace this with **clean vs degraded resolution contrast**:
- π¨βπ« **Teacher** gets clean 30β―cm satellite imagery.
- π¨βπ **Student** sees augmented versions of the same scene at varying resolutions (30β―cm β 2β―m) with photometric and spatial distortions.
This setup encourages the model to learn **scale-invariant** and **semantic-aware** features across real-world EO resolution gaps.
#### 2οΈβ£ **I-JEPA-Style Patch Dropping**
We integrate ideas from [I-JEPA](https://arxiv.org/abs/2301.08243):
- Random **patch regions are dropped** from the student input.
- The objective is to align the **visible patch embeddings** with the teacherβs corresponding high-resolution ones.
- This enforces **local-global and partial-whole consistency** in the latent space.
#### 3οΈβ£ **YOLOv11-X as Encoder Backbone**
- We use **YOLOv11-X**, one of the most powerful and recent YOLO variants, as the spatial encoder.
- The backbone is **truncated after 23 layers**, retaining rich spatial semantics while maintaining efficiency.
- This provides strong priors from supervised detection tasks, now adapted for **self-supervised** learning.
---
## π§ͺ Training Flow: Resolution-Agnostic DiNO
The training pipeline in `core-dino` follows a student-teacher design inspired by DINO, but adapted for real-world satellite imagery:
#### π¨βπ« 1. Teacher View (Clean & High-Res)
- Receives a **clean 30β―cm image** without any augmentation.
- Used as the stable reference to guide the student.
#### π¨βπ 2. Student View (Augmented Multi-Resolution)
- Receives **randomly augmented** versions of the same image:
- Downsampled to **30β―cm to 2β―m**
- Augmented with noise, blur, color jitter, spatial dropout, etc.
- Emulates resolution variability common in EO imagery.
#### β οΈ 3. Spatial Misalignment & Solution
- Since different student resolutions produce different spatial dimensions (H Γ W),
we use **bilinear interpolation** to **resize the studentβs feature map** to match the teacher's spatial shape before computing the contrastive loss.
#### π― 4. Objective
- Align the spatial token embeddings of the student with the teacher β pixel-to-pixel and semantically β despite resolution gaps and augmentations.
- Encourages **scale-invariant**, **robust** feature learning across real-world variations.
---
## π Performance: Latent Quality & Downstream Evaluation
Despite being trained without any labels, `core-dino` demonstrates strong latent alignment and generalization capability β both in visual similarity and downstream tasks.
### π£οΈ Downstream: Road Extraction (DeepGlobe Dataset)
We evaluated `core-dino` on the [DeepGlobe Road Extraction Dataset](https://competitions.codalab.org/competitions/18467#learn_the_details), using it as a frozen backbone in a simple segmentation pipeline.
- **Setup:**
- Both `core-dino` and **YOLOv11-X** backbones were **frozen**
- Only a **2-layer convolutional head** was trained
- Task: Binary road segmentation using IoU loss
- **Result:**
- `core-dino` consistently outperformed the supervised **YOLOv11-X** backbone across all epochs
- Shows superior latent representation quality, even without task-specific supervision
- Demonstrates better **generalization** and **semantic robustness** in downstream transfer tasks
<p style="display: inline-flex; align-items: center; gap: 8px; margin-top: 10px;">
<span style="font-size: 16px;">π <strong>Reproduce this comparison in Colab:</strong></span>
<a href="https://colab.research.google.com/drive/1JqJoboLljDc2EoqMvj40mA1Sa1vnCHry" target="_blank" style="display: inline-block; vertical-align: middle;">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab">
</a>
</p>
<p align="center"><br>
<img src="assets/downstream-deepglobe-roads.png" alt="Downstream Performance" style="width:85%;">
</p>
### ποΈ Downstream : Building Footprint Validation
To evaluate transferability to structural segmentation tasks, we tested `core-dino` on **building footprint extraction** using high-resolution satellite imagery.
- **Setup:**
- Compared **YOLOv11-X (original weights)** vs. **YOLOv11-X initialized with `core-dino` weights**
- Used same training pipeline for both
- **Result:**
- `core-dino` achieved **+15 mAP** improvement over standard YOLOv11-X
- Captures edge-localized and compact building structures better
- Demonstrates strong spatial precision and high-quality feature encoding
<p style="display: inline-flex; align-items: center; gap: 8px; margin-top: 10px;">
<span style="font-size: 16px;">π <strong>Reproduce this comparison in Colab:</strong></span>
<a href="https://colab.research.google.com/drive/1uAqUNUDQt0_29Zhvopz0rWVSAzX-cZrk" target="_blank" style="display: inline-block; vertical-align: middle;">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab">
</a>
</p>
<p align="center"><br>
<img src="assets/downstream-building-footprint.png" alt="Downstream Performance" style="width:85%;">
</p>
---
## ποΈ Model Details
| Field | Value |
|--------------------|--------------------------------------------------------------|
| Parameters | **56.7M** |
| Backbone Architecture | **YOLOv11 X** |
| Input Size | **320 Γ 320 β 4096 Γ 4096** |
| Patch Source | [Core-Five](https://huggingface.co/datasets/gajeshladhar/core-five) |
| Resolutions | 30β―cm (clean) β 2β―m (augmented) |
| Patch Drop | I-JEPA-style masking |
| Loss | DINO contrastive loss |
| Training Time | ~48h on 1ΓA100 |
---
## π³ License
This project is released under the **[Creative Commons Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0)](https://creativecommons.org/licenses/by-nc/3.0/)** license.
> β
Free to use, share, and adapt for **non-commercial research**
> β **Commercial use is not permitted** without explicit permission
> π Please provide appropriate credit when using this dataset in publications or projects.
|