|
--- |
|
license: cc-by-nc-4.0 |
|
--- |
|
|
|
# 🌐 core-dino | Resolution-Agnostic Self-Supervised Learning on Satellite Imagery |
|
|
|
[](https://huggingface.co/gajeshladhar/core-dino) |
|
[](https://colab.research.google.com/drive/1JvSx0AERGWoc8vAZAxOSPOOG7PAh6KqZ) |
|
 |
|
 |
|
|
|
--- |
|
|
|
## 🔭 Overview |
|
|
|
`core-dino` is a resolution-agnostic **self-supervised model** designed for satellite imagery, trained on the [Core-Five dataset](https://huggingface.co/datasets/gajeshladhar/core-five) using a DiNO-inspired setup. It handles imagery between **20 cm and 2 m**, learning strong spatial features without any labels. |
|
|
|
<p> |
|
<a href="https://colab.research.google.com/drive/1JvSx0AERGWoc8vAZAxOSPOOG7PAh6KqZ" target="_blank"> |
|
<b>Open Demo ▶️</b> |
|
</a> - Run multi-resolution inference & visualize spatial embeddings.</p> |
|
|
|
--- |
|
|
|
## 🌀 Quickstart |
|
|
|
```python |
|
from ultralytics import YOLO |
|
|
|
model = YOLO("yolo11x-obb.pt") # obb, bbox or seg any model |
|
ckpt = "https://huggingface.co/gajeshladhar/core-dino/resolve/main/checkpoints/student.pt" |
|
ckpt = torch.hub.load_state_dict_from_url(ckpt, map_location='cpu') |
|
model.model.load_state_dict( |
|
{k.replace('layers.', 'model.'): v for k, v in ckpt.items()}, |
|
strict=False) |
|
``` |
|
|
|
--- |
|
|
|
## 🧠 Architecture: DiNO × YOLO × I-JEPA |
|
|
|
We combine three ideas to build a high-performance backbone for spatial representation learning: |
|
|
|
#### 1️⃣ **Multi-Resolution DINO Setup (instead of local-global)** |
|
> In standard [DINO](https://arxiv.org/abs/2104.14294) / [DINOv2](https://arxiv.org/abs/2304.07193), the student sees cropped or distorted views (local), while the teacher sees global views. |
|
> In `core-dino`, we replace this with **clean vs degraded resolution contrast**: |
|
- 👨🏫 **Teacher** gets clean 30 cm satellite imagery. |
|
- 👨🎓 **Student** sees augmented versions of the same scene at varying resolutions (30 cm → 2 m) with photometric and spatial distortions. |
|
|
|
This setup encourages the model to learn **scale-invariant** and **semantic-aware** features across real-world EO resolution gaps. |
|
|
|
#### 2️⃣ **I-JEPA-Style Patch Dropping** |
|
We integrate ideas from [I-JEPA](https://arxiv.org/abs/2301.08243): |
|
- Random **patch regions are dropped** from the student input. |
|
- The objective is to align the **visible patch embeddings** with the teacher’s corresponding high-resolution ones. |
|
- This enforces **local-global and partial-whole consistency** in the latent space. |
|
|
|
#### 3️⃣ **YOLOv11-X as Encoder Backbone** |
|
- We use **YOLOv11-X**, one of the most powerful and recent YOLO variants, as the spatial encoder. |
|
- The backbone is **truncated after 23 layers**, retaining rich spatial semantics while maintaining efficiency. |
|
- This provides strong priors from supervised detection tasks, now adapted for **self-supervised** learning. |
|
|
|
|
|
--- |
|
|
|
## 🧪 Training Flow: Resolution-Agnostic DiNO |
|
|
|
The training pipeline in `core-dino` follows a student-teacher design inspired by DINO, but adapted for real-world satellite imagery: |
|
|
|
#### 👨🏫 1. Teacher View (Clean & High-Res) |
|
- Receives a **clean 30 cm image** without any augmentation. |
|
- Used as the stable reference to guide the student. |
|
|
|
#### 👨🎓 2. Student View (Augmented Multi-Resolution) |
|
- Receives **randomly augmented** versions of the same image: |
|
- Downsampled to **30 cm to 2 m** |
|
- Augmented with noise, blur, color jitter, spatial dropout, etc. |
|
- Emulates resolution variability common in EO imagery. |
|
|
|
#### ⚠️ 3. Spatial Misalignment & Solution |
|
- Since different student resolutions produce different spatial dimensions (H × W), |
|
we use **bilinear interpolation** to **resize the student’s feature map** to match the teacher's spatial shape before computing the contrastive loss. |
|
|
|
#### 🎯 4. Objective |
|
- Align the spatial token embeddings of the student with the teacher — pixel-to-pixel and semantically — despite resolution gaps and augmentations. |
|
- Encourages **scale-invariant**, **robust** feature learning across real-world variations. |
|
|
|
|
|
|
|
--- |
|
|
|
## 📈 Performance: Latent Quality & Downstream Evaluation |
|
|
|
Despite being trained without any labels, `core-dino` demonstrates strong latent alignment and generalization capability — both in visual similarity and downstream tasks. |
|
|
|
### 🛣️ Downstream: Road Extraction (DeepGlobe Dataset) |
|
|
|
We evaluated `core-dino` on the [DeepGlobe Road Extraction Dataset](https://competitions.codalab.org/competitions/18467#learn_the_details), using it as a frozen backbone in a simple segmentation pipeline. |
|
|
|
- **Setup:** |
|
- Both `core-dino` and **YOLOv11-X** backbones were **frozen** |
|
- Only a **2-layer convolutional head** was trained |
|
- Task: Binary road segmentation using IoU loss |
|
|
|
- **Result:** |
|
- `core-dino` consistently outperformed the supervised **YOLOv11-X** backbone across all epochs |
|
- Shows superior latent representation quality, even without task-specific supervision |
|
- Demonstrates better **generalization** and **semantic robustness** in downstream transfer tasks |
|
|
|
<p style="display: inline-flex; align-items: center; gap: 8px; margin-top: 10px;"> |
|
<span style="font-size: 16px;">📓 <strong>Reproduce this comparison in Colab:</strong></span> |
|
<a href="https://colab.research.google.com/drive/1JqJoboLljDc2EoqMvj40mA1Sa1vnCHry" target="_blank" style="display: inline-block; vertical-align: middle;"> |
|
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"> |
|
</a> |
|
</p> |
|
<p align="center"><br> |
|
<img src="assets/downstream-deepglobe-roads.png" alt="Downstream Performance" style="width:85%;"> |
|
</p> |
|
|
|
|
|
### 🏙️ Downstream : Building Footprint Validation |
|
|
|
To evaluate transferability to structural segmentation tasks, we tested `core-dino` on **building footprint extraction** using high-resolution satellite imagery. |
|
|
|
- **Setup:** |
|
- Compared **YOLOv11-X (original weights)** vs. **YOLOv11-X initialized with `core-dino` weights** |
|
- Used same training pipeline for both |
|
|
|
- **Result:** |
|
- `core-dino` achieved **+15 mAP** improvement over standard YOLOv11-X |
|
- Captures edge-localized and compact building structures better |
|
- Demonstrates strong spatial precision and high-quality feature encoding |
|
|
|
<p style="display: inline-flex; align-items: center; gap: 8px; margin-top: 10px;"> |
|
<span style="font-size: 16px;">📓 <strong>Reproduce this comparison in Colab:</strong></span> |
|
<a href="https://colab.research.google.com/drive/1uAqUNUDQt0_29Zhvopz0rWVSAzX-cZrk" target="_blank" style="display: inline-block; vertical-align: middle;"> |
|
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"> |
|
</a> |
|
</p> |
|
<p align="center"><br> |
|
<img src="assets/downstream-building-footprint.png" alt="Downstream Performance" style="width:85%;"> |
|
</p> |
|
|
|
|
|
--- |
|
|
|
## 🗂️ Model Details |
|
|
|
| Field | Value | |
|
|--------------------|--------------------------------------------------------------| |
|
| Parameters | **56.7M** | |
|
| Backbone Architecture | **YOLOv11 X** | |
|
| Input Size | **320 × 320 – 4096 × 4096** | |
|
| Patch Source | [Core-Five](https://huggingface.co/datasets/gajeshladhar/core-five) | |
|
| Resolutions | 30 cm (clean) → 2 m (augmented) | |
|
| Patch Drop | I-JEPA-style masking | |
|
| Loss | DINO contrastive loss | |
|
| Training Time | ~48h on 1×A100 | |
|
|
|
|
|
--- |
|
## 💳 License |
|
|
|
This project is released under the **[Creative Commons Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0)](https://creativecommons.org/licenses/by-nc/3.0/)** license. |
|
|
|
> ✅ Free to use, share, and adapt for **non-commercial research** |
|
> ❌ **Commercial use is not permitted** without explicit permission |
|
> 📌 Please provide appropriate credit when using this dataset in publications or projects. |
|
|
|
|