File size: 8,438 Bytes
a7d8629
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
---
license: cc-by-nc-4.0
---

# 🌐 core-dino | Resolution-Agnostic Self-Supervised Learning on Satellite Imagery

[![πŸ€— Model Hub](https://img.shields.io/badge/HuggingFace-core--dino-blue?logo=huggingface&logoColor=white)](https://huggingface.co/gajeshladhar/core-dino)
[![πŸš€ Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1JvSx0AERGWoc8vAZAxOSPOOG7PAh6KqZ)
![πŸ›°οΈ Domain](https://img.shields.io/badge/Domain-Earth%20Observation-green)
![πŸ” Task](https://img.shields.io/badge/Task-Self--Supervised--Learning-orange)

---

## πŸ”­ Overview

`core-dino` is a resolution-agnostic **self-supervised model** designed for satellite imagery, trained on the [Core-Five dataset](https://huggingface.co/datasets/gajeshladhar/core-five) using a DiNO-inspired setup. It handles imagery between **20β€―cm and 2β€―m**, learning strong spatial features without any labels.

<p>
<a href="https://colab.research.google.com/drive/1JvSx0AERGWoc8vAZAxOSPOOG7PAh6KqZ" target="_blank">
  <b>Open Demo ▢️</b>
</a> - Run multi-resolution inference & visualize spatial embeddings.</p>

---

## πŸŒ€ Quickstart

```python
from ultralytics import YOLO

model = YOLO("yolo11x-obb.pt") # obb, bbox or seg any model
ckpt = "https://huggingface.co/gajeshladhar/core-dino/resolve/main/checkpoints/student.pt"
ckpt = torch.hub.load_state_dict_from_url(ckpt, map_location='cpu')
model.model.load_state_dict(
    {k.replace('layers.', 'model.'): v for k, v in ckpt.items()},
    strict=False)
```

---

## 🧠 Architecture: DiNO Γ— YOLO Γ— I-JEPA

We combine three ideas to build a high-performance backbone for spatial representation learning:

#### 1️⃣ **Multi-Resolution DINO Setup (instead of local-global)**
> In standard [DINO](https://arxiv.org/abs/2104.14294) / [DINOv2](https://arxiv.org/abs/2304.07193), the student sees cropped or distorted views (local), while the teacher sees global views.  
> In `core-dino`, we replace this with **clean vs degraded resolution contrast**:
- πŸ‘¨β€πŸ« **Teacher** gets clean 30β€―cm satellite imagery.
- πŸ‘¨β€πŸŽ“ **Student** sees augmented versions of the same scene at varying resolutions (30β€―cm β†’ 2β€―m) with photometric and spatial distortions.

This setup encourages the model to learn **scale-invariant** and **semantic-aware** features across real-world EO resolution gaps.

#### 2️⃣ **I-JEPA-Style Patch Dropping**
We integrate ideas from [I-JEPA](https://arxiv.org/abs/2301.08243):
- Random **patch regions are dropped** from the student input.
- The objective is to align the **visible patch embeddings** with the teacher’s corresponding high-resolution ones.
- This enforces **local-global and partial-whole consistency** in the latent space.

#### 3️⃣ **YOLOv11-X as Encoder Backbone**
- We use **YOLOv11-X**, one of the most powerful and recent YOLO variants, as the spatial encoder.
- The backbone is **truncated after 23 layers**, retaining rich spatial semantics while maintaining efficiency.
- This provides strong priors from supervised detection tasks, now adapted for **self-supervised** learning.


---

## πŸ§ͺ Training Flow: Resolution-Agnostic DiNO

The training pipeline in `core-dino` follows a student-teacher design inspired by DINO, but adapted for real-world satellite imagery:

#### πŸ‘¨β€πŸ«  1. Teacher View (Clean & High-Res)
- Receives a **clean 30β€―cm image** without any augmentation.
- Used as the stable reference to guide the student.

#### πŸ‘¨β€πŸŽ“ 2. Student View (Augmented Multi-Resolution)
- Receives **randomly augmented** versions of the same image:
  - Downsampled to **30β€―cm to 2β€―m**
  - Augmented with noise, blur, color jitter, spatial dropout, etc.
- Emulates resolution variability common in EO imagery.

#### ⚠️ 3. Spatial Misalignment & Solution
- Since different student resolutions produce different spatial dimensions (H Γ— W),  
  we use **bilinear interpolation** to **resize the student’s feature map** to match the teacher's spatial shape before computing the contrastive loss.

#### 🎯 4. Objective
- Align the spatial token embeddings of the student with the teacher β€” pixel-to-pixel and semantically β€” despite resolution gaps and augmentations.
- Encourages **scale-invariant**, **robust** feature learning across real-world variations.



---

## πŸ“ˆ Performance: Latent Quality & Downstream Evaluation

Despite being trained without any labels, `core-dino` demonstrates strong latent alignment and generalization capability β€” both in visual similarity and downstream tasks.

### πŸ›£οΈ Downstream: Road Extraction (DeepGlobe Dataset)

We evaluated `core-dino` on the [DeepGlobe Road Extraction Dataset](https://competitions.codalab.org/competitions/18467#learn_the_details), using it as a frozen backbone in a simple segmentation pipeline.

- **Setup:**  
  - Both `core-dino` and **YOLOv11-X** backbones were **frozen**  
  - Only a **2-layer convolutional head** was trained  
  - Task: Binary road segmentation using IoU loss

- **Result:**  
  - `core-dino` consistently outperformed the supervised **YOLOv11-X** backbone across all epochs  
  - Shows superior latent representation quality, even without task-specific supervision  
  - Demonstrates better **generalization** and **semantic robustness** in downstream transfer tasks

<p style="display: inline-flex; align-items: center; gap: 8px; margin-top: 10px;">
  <span style="font-size: 16px;">πŸ““ <strong>Reproduce this comparison in Colab:</strong></span>
  <a href="https://colab.research.google.com/drive/1JqJoboLljDc2EoqMvj40mA1Sa1vnCHry" target="_blank" style="display: inline-block; vertical-align: middle;">
    <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab">
  </a>
</p>
<p align="center"><br>
  <img src="assets/downstream-deepglobe-roads.png" alt="Downstream Performance" style="width:85%;">
</p>


### πŸ™οΈ Downstream : Building Footprint Validation

To evaluate transferability to structural segmentation tasks, we tested `core-dino` on **building footprint extraction** using high-resolution satellite imagery.

- **Setup:**  
  - Compared **YOLOv11-X (original weights)** vs. **YOLOv11-X initialized with `core-dino` weights**  
  - Used same training pipeline for both

- **Result:**  
  - `core-dino` achieved **+15 mAP** improvement over standard YOLOv11-X  
  - Captures edge-localized and compact building structures better  
  - Demonstrates strong spatial precision and high-quality feature encoding

<p style="display: inline-flex; align-items: center; gap: 8px; margin-top: 10px;">
  <span style="font-size: 16px;">πŸ““ <strong>Reproduce this comparison in Colab:</strong></span>
  <a href="https://colab.research.google.com/drive/1uAqUNUDQt0_29Zhvopz0rWVSAzX-cZrk" target="_blank" style="display: inline-block; vertical-align: middle;">
    <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab">
  </a>
</p>
<p align="center"><br>
  <img src="assets/downstream-building-footprint.png" alt="Downstream Performance" style="width:85%;">
</p>


---

## πŸ—‚οΈ Model Details

| Field              | Value                                                        |
|--------------------|--------------------------------------------------------------|
| Parameters         | **56.7M**                                                    |
| Backbone Architecture | **YOLOv11 X**                        |
| Input Size         | **320 Γ— 320 – 4096 Γ— 4096**                                  |
| Patch Source       | [Core-Five](https://huggingface.co/datasets/gajeshladhar/core-five) |
| Resolutions        | 30β€―cm (clean) β†’ 2β€―m (augmented)                              |
| Patch Drop         | I-JEPA-style masking                                         |
| Loss               | DINO contrastive loss                                        |
| Training Time      | ~48h on 1Γ—A100                                               |


---
## πŸ’³ License  

This project is released under the **[Creative Commons Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0)](https://creativecommons.org/licenses/by-nc/3.0/)** license.

> βœ… Free to use, share, and adapt for **non-commercial research**  
> ❌ **Commercial use is not permitted** without explicit permission  
> πŸ“Œ Please provide appropriate credit when using this dataset in publications or projects.