core-dino / README.md

Update README.md

a7d8629 verified 2 months ago

8.44 kB

	---
	license: cc-by-nc-4.0
	---

	# 🌐 core-dino \| Resolution-Agnostic Self-Supervised Learning on Satellite Imagery

	[![🤗 Model Hub](https://img.shields.io/badge/HuggingFace-core--dino-blue?logo=huggingface&logoColor=white)](https://huggingface.co/gajeshladhar/core-dino)
	[![🚀 Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1JvSx0AERGWoc8vAZAxOSPOOG7PAh6KqZ)
	![🛰️ Domain](https://img.shields.io/badge/Domain-Earth%20Observation-green)
	![🔍 Task](https://img.shields.io/badge/Task-Self--Supervised--Learning-orange)

	---

	## 🔭 Overview

	`core-dino` is a resolution-agnostic self-supervised model designed for satellite imagery, trained on the [Core-Five dataset](https://huggingface.co/datasets/gajeshladhar/core-five) using a DiNO-inspired setup. It handles imagery between 20 cm and 2 m, learning strong spatial features without any labels.

	<p>
	<a href="https://colab.research.google.com/drive/1JvSx0AERGWoc8vAZAxOSPOOG7PAh6KqZ" target="_blank">
	<b>Open Demo ▶️</b>
	</a> - Run multi-resolution inference & visualize spatial embeddings.</p>

	---

	## 🌀 Quickstart

	```python
	from ultralytics import YOLO

	model = YOLO("yolo11x-obb.pt") # obb, bbox or seg any model
	ckpt = "https://huggingface.co/gajeshladhar/core-dino/resolve/main/checkpoints/student.pt"
	ckpt = torch.hub.load_state_dict_from_url(ckpt, map_location='cpu')
	model.model.load_state_dict(
	{k.replace('layers.', 'model.'): v for k, v in ckpt.items()},
	strict=False)
	```

	---

	## 🧠 Architecture: DiNO × YOLO × I-JEPA

	We combine three ideas to build a high-performance backbone for spatial representation learning:

	#### 1️⃣ Multi-Resolution DINO Setup (instead of local-global)
	> In standard [DINO](https://arxiv.org/abs/2104.14294) / [DINOv2](https://arxiv.org/abs/2304.07193), the student sees cropped or distorted views (local), while the teacher sees global views.
	> In `core-dino`, we replace this with clean vs degraded resolution contrast:
	- 👨‍🏫 Teacher gets clean 30 cm satellite imagery.
	- 👨‍🎓 Student sees augmented versions of the same scene at varying resolutions (30 cm → 2 m) with photometric and spatial distortions.

	This setup encourages the model to learn scale-invariant and semantic-aware features across real-world EO resolution gaps.

	#### 2️⃣ I-JEPA-Style Patch Dropping
	We integrate ideas from [I-JEPA](https://arxiv.org/abs/2301.08243):
	- Random patch regions are dropped from the student input.
	- The objective is to align the visible patch embeddings with the teacher’s corresponding high-resolution ones.
	- This enforces local-global and partial-whole consistency in the latent space.

	#### 3️⃣ YOLOv11-X as Encoder Backbone
	- We use YOLOv11-X, one of the most powerful and recent YOLO variants, as the spatial encoder.
	- The backbone is truncated after 23 layers, retaining rich spatial semantics while maintaining efficiency.
	- This provides strong priors from supervised detection tasks, now adapted for self-supervised learning.


	---

	## 🧪 Training Flow: Resolution-Agnostic DiNO

	The training pipeline in `core-dino` follows a student-teacher design inspired by DINO, but adapted for real-world satellite imagery:

	#### 👨‍🏫 1. Teacher View (Clean & High-Res)
	- Receives a clean 30 cm image without any augmentation.
	- Used as the stable reference to guide the student.

	#### 👨‍🎓 2. Student View (Augmented Multi-Resolution)
	- Receives randomly augmented versions of the same image:
	- Downsampled to 30 cm to 2 m
	- Augmented with noise, blur, color jitter, spatial dropout, etc.
	- Emulates resolution variability common in EO imagery.

	#### ⚠️ 3. Spatial Misalignment & Solution
	- Since different student resolutions produce different spatial dimensions (H × W),
	we use bilinear interpolation to resize the student’s feature map to match the teacher's spatial shape before computing the contrastive loss.

	#### 🎯 4. Objective
	- Align the spatial token embeddings of the student with the teacher — pixel-to-pixel and semantically — despite resolution gaps and augmentations.
	- Encourages scale-invariant, robust feature learning across real-world variations.



	---

	## 📈 Performance: Latent Quality & Downstream Evaluation

	Despite being trained without any labels, `core-dino` demonstrates strong latent alignment and generalization capability — both in visual similarity and downstream tasks.

	### 🛣️ Downstream: Road Extraction (DeepGlobe Dataset)

	We evaluated `core-dino` on the [DeepGlobe Road Extraction Dataset](https://competitions.codalab.org/competitions/18467#learn_the_details), using it as a frozen backbone in a simple segmentation pipeline.

	- Setup:
	- Both `core-dino` and YOLOv11-X backbones were frozen
	- Only a 2-layer convolutional head was trained
	- Task: Binary road segmentation using IoU loss

	- Result:
	- `core-dino` consistently outperformed the supervised YOLOv11-X backbone across all epochs
	- Shows superior latent representation quality, even without task-specific supervision
	- Demonstrates better generalization and semantic robustness in downstream transfer tasks

	<p style="display: inline-flex; align-items: center; gap: 8px; margin-top: 10px;">
	<span style="font-size: 16px;">📓 <strong>Reproduce this comparison in Colab:</strong></span>
	<a href="https://colab.research.google.com/drive/1JqJoboLljDc2EoqMvj40mA1Sa1vnCHry" target="_blank" style="display: inline-block; vertical-align: middle;">
	<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab">
	</a>
	</p>
	<p align="center"><br>
	<img src="assets/downstream-deepglobe-roads.png" alt="Downstream Performance" style="width:85%;">
	</p>


	### 🏙️ Downstream : Building Footprint Validation

	To evaluate transferability to structural segmentation tasks, we tested `core-dino` on building footprint extraction using high-resolution satellite imagery.

	- Setup:
	- Compared YOLOv11-X (original weights) vs. YOLOv11-X initialized with `core-dino` weights
	- Used same training pipeline for both

	- Result:
	- `core-dino` achieved +15 mAP improvement over standard YOLOv11-X
	- Captures edge-localized and compact building structures better
	- Demonstrates strong spatial precision and high-quality feature encoding

	<p style="display: inline-flex; align-items: center; gap: 8px; margin-top: 10px;">
	<span style="font-size: 16px;">📓 <strong>Reproduce this comparison in Colab:</strong></span>
	<a href="https://colab.research.google.com/drive/1uAqUNUDQt0_29Zhvopz0rWVSAzX-cZrk" target="_blank" style="display: inline-block; vertical-align: middle;">
	<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab">
	</a>
	</p>
	<p align="center"><br>
	<img src="assets/downstream-building-footprint.png" alt="Downstream Performance" style="width:85%;">
	</p>


	---

	## 🗂️ Model Details

	\| Field \| Value \|
	\|--------------------\|--------------------------------------------------------------\|
	\| Parameters \| 56.7M \|
	\| Backbone Architecture \| YOLOv11 X \|
	\| Input Size \| 320 × 320 – 4096 × 4096 \|
	\| Patch Source \| [Core-Five](https://huggingface.co/datasets/gajeshladhar/core-five) \|
	\| Resolutions \| 30 cm (clean) → 2 m (augmented) \|
	\| Patch Drop \| I-JEPA-style masking \|
	\| Loss \| DINO contrastive loss \|
	\| Training Time \| ~48h on 1×A100 \|


	---
	## 💳 License

	This project is released under the [Creative Commons Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0)](https://creativecommons.org/licenses/by-nc/3.0/) license.

	> ✅ Free to use, share, and adapt for non-commercial research
	> ❌ Commercial use is not permitted without explicit permission
	> 📌 Please provide appropriate credit when using this dataset in publications or projects.