core-dino / README.md
gajeshladhar's picture
Update README.md
a7d8629 verified
---
license: cc-by-nc-4.0
---
# 🌐 core-dino | Resolution-Agnostic Self-Supervised Learning on Satellite Imagery
[![🤗 Model Hub](https://img.shields.io/badge/HuggingFace-core--dino-blue?logo=huggingface&logoColor=white)](https://huggingface.co/gajeshladhar/core-dino)
[![🚀 Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1JvSx0AERGWoc8vAZAxOSPOOG7PAh6KqZ)
![🛰️ Domain](https://img.shields.io/badge/Domain-Earth%20Observation-green)
![🔍 Task](https://img.shields.io/badge/Task-Self--Supervised--Learning-orange)
---
## 🔭 Overview
`core-dino` is a resolution-agnostic **self-supervised model** designed for satellite imagery, trained on the [Core-Five dataset](https://huggingface.co/datasets/gajeshladhar/core-five) using a DiNO-inspired setup. It handles imagery between **20 cm and 2 m**, learning strong spatial features without any labels.
<p>
<a href="https://colab.research.google.com/drive/1JvSx0AERGWoc8vAZAxOSPOOG7PAh6KqZ" target="_blank">
<b>Open Demo ▶️</b>
</a> - Run multi-resolution inference & visualize spatial embeddings.</p>
---
## 🌀 Quickstart
```python
from ultralytics import YOLO
model = YOLO("yolo11x-obb.pt") # obb, bbox or seg any model
ckpt = "https://huggingface.co/gajeshladhar/core-dino/resolve/main/checkpoints/student.pt"
ckpt = torch.hub.load_state_dict_from_url(ckpt, map_location='cpu')
model.model.load_state_dict(
{k.replace('layers.', 'model.'): v for k, v in ckpt.items()},
strict=False)
```
---
## 🧠 Architecture: DiNO × YOLO × I-JEPA
We combine three ideas to build a high-performance backbone for spatial representation learning:
#### 1️⃣ **Multi-Resolution DINO Setup (instead of local-global)**
> In standard [DINO](https://arxiv.org/abs/2104.14294) / [DINOv2](https://arxiv.org/abs/2304.07193), the student sees cropped or distorted views (local), while the teacher sees global views.
> In `core-dino`, we replace this with **clean vs degraded resolution contrast**:
- 👨‍🏫 **Teacher** gets clean 30 cm satellite imagery.
- 👨‍🎓 **Student** sees augmented versions of the same scene at varying resolutions (30 cm → 2 m) with photometric and spatial distortions.
This setup encourages the model to learn **scale-invariant** and **semantic-aware** features across real-world EO resolution gaps.
#### 2️⃣ **I-JEPA-Style Patch Dropping**
We integrate ideas from [I-JEPA](https://arxiv.org/abs/2301.08243):
- Random **patch regions are dropped** from the student input.
- The objective is to align the **visible patch embeddings** with the teacher’s corresponding high-resolution ones.
- This enforces **local-global and partial-whole consistency** in the latent space.
#### 3️⃣ **YOLOv11-X as Encoder Backbone**
- We use **YOLOv11-X**, one of the most powerful and recent YOLO variants, as the spatial encoder.
- The backbone is **truncated after 23 layers**, retaining rich spatial semantics while maintaining efficiency.
- This provides strong priors from supervised detection tasks, now adapted for **self-supervised** learning.
---
## 🧪 Training Flow: Resolution-Agnostic DiNO
The training pipeline in `core-dino` follows a student-teacher design inspired by DINO, but adapted for real-world satellite imagery:
#### 👨‍🏫 1. Teacher View (Clean & High-Res)
- Receives a **clean 30 cm image** without any augmentation.
- Used as the stable reference to guide the student.
#### 👨‍🎓 2. Student View (Augmented Multi-Resolution)
- Receives **randomly augmented** versions of the same image:
- Downsampled to **30 cm to 2 m**
- Augmented with noise, blur, color jitter, spatial dropout, etc.
- Emulates resolution variability common in EO imagery.
#### ⚠️ 3. Spatial Misalignment & Solution
- Since different student resolutions produce different spatial dimensions (H × W),
we use **bilinear interpolation** to **resize the student’s feature map** to match the teacher's spatial shape before computing the contrastive loss.
#### 🎯 4. Objective
- Align the spatial token embeddings of the student with the teacher — pixel-to-pixel and semantically — despite resolution gaps and augmentations.
- Encourages **scale-invariant**, **robust** feature learning across real-world variations.
---
## 📈 Performance: Latent Quality & Downstream Evaluation
Despite being trained without any labels, `core-dino` demonstrates strong latent alignment and generalization capability — both in visual similarity and downstream tasks.
### 🛣️ Downstream: Road Extraction (DeepGlobe Dataset)
We evaluated `core-dino` on the [DeepGlobe Road Extraction Dataset](https://competitions.codalab.org/competitions/18467#learn_the_details), using it as a frozen backbone in a simple segmentation pipeline.
- **Setup:**
- Both `core-dino` and **YOLOv11-X** backbones were **frozen**
- Only a **2-layer convolutional head** was trained
- Task: Binary road segmentation using IoU loss
- **Result:**
- `core-dino` consistently outperformed the supervised **YOLOv11-X** backbone across all epochs
- Shows superior latent representation quality, even without task-specific supervision
- Demonstrates better **generalization** and **semantic robustness** in downstream transfer tasks
<p style="display: inline-flex; align-items: center; gap: 8px; margin-top: 10px;">
<span style="font-size: 16px;">📓 <strong>Reproduce this comparison in Colab:</strong></span>
<a href="https://colab.research.google.com/drive/1JqJoboLljDc2EoqMvj40mA1Sa1vnCHry" target="_blank" style="display: inline-block; vertical-align: middle;">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab">
</a>
</p>
<p align="center"><br>
<img src="assets/downstream-deepglobe-roads.png" alt="Downstream Performance" style="width:85%;">
</p>
### 🏙️ Downstream : Building Footprint Validation
To evaluate transferability to structural segmentation tasks, we tested `core-dino` on **building footprint extraction** using high-resolution satellite imagery.
- **Setup:**
- Compared **YOLOv11-X (original weights)** vs. **YOLOv11-X initialized with `core-dino` weights**
- Used same training pipeline for both
- **Result:**
- `core-dino` achieved **+15 mAP** improvement over standard YOLOv11-X
- Captures edge-localized and compact building structures better
- Demonstrates strong spatial precision and high-quality feature encoding
<p style="display: inline-flex; align-items: center; gap: 8px; margin-top: 10px;">
<span style="font-size: 16px;">📓 <strong>Reproduce this comparison in Colab:</strong></span>
<a href="https://colab.research.google.com/drive/1uAqUNUDQt0_29Zhvopz0rWVSAzX-cZrk" target="_blank" style="display: inline-block; vertical-align: middle;">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab">
</a>
</p>
<p align="center"><br>
<img src="assets/downstream-building-footprint.png" alt="Downstream Performance" style="width:85%;">
</p>
---
## 🗂️ Model Details
| Field | Value |
|--------------------|--------------------------------------------------------------|
| Parameters | **56.7M** |
| Backbone Architecture | **YOLOv11 X** |
| Input Size | **320 × 320 – 4096 × 4096** |
| Patch Source | [Core-Five](https://huggingface.co/datasets/gajeshladhar/core-five) |
| Resolutions | 30 cm (clean) → 2 m (augmented) |
| Patch Drop | I-JEPA-style masking |
| Loss | DINO contrastive loss |
| Training Time | ~48h on 1×A100 |
---
## 💳 License
This project is released under the **[Creative Commons Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0)](https://creativecommons.org/licenses/by-nc/3.0/)** license.
> ✅ Free to use, share, and adapt for **non-commercial research**
> ❌ **Commercial use is not permitted** without explicit permission
> 📌 Please provide appropriate credit when using this dataset in publications or projects.