File size: 8,438 Bytes

a7d8629

---
license: cc-by-nc-4.0
---

# 🌐 core-dino | Resolution-Agnostic Self-Supervised Learning on Satellite Imagery

[![🤗 Model Hub](https://img.shields.io/badge/HuggingFace-core--dino-blue?logo=huggingface&logoColor=white)](https://huggingface.co/gajeshladhar/core-dino)
[![🚀 Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1JvSx0AERGWoc8vAZAxOSPOOG7PAh6KqZ)
![🛰️ Domain](https://img.shields.io/badge/Domain-Earth%20Observation-green)
![🔍 Task](https://img.shields.io/badge/Task-Self--Supervised--Learning-orange)

---

## 🔭 Overview

`core-dino` is a resolution-agnostic **self-supervised model** designed for satellite imagery, trained on the [Core-Five dataset](https://huggingface.co/datasets/gajeshladhar/core-five) using a DiNO-inspired setup. It handles imagery between **20 cm and 2 m**, learning strong spatial features without any labels.

<p>
<a href="https://colab.research.google.com/drive/1JvSx0AERGWoc8vAZAxOSPOOG7PAh6KqZ" target="_blank">
  <b>Open Demo ▶️</b>
</a> - Run multi-resolution inference & visualize spatial embeddings.</p>

---

## 🌀 Quickstart

```python
from ultralytics import YOLO

model = YOLO("yolo11x-obb.pt") # obb, bbox or seg any model
ckpt = "https://huggingface.co/gajeshladhar/core-dino/resolve/main/checkpoints/student.pt"
ckpt = torch.hub.load_state_dict_from_url(ckpt, map_location='cpu')
model.model.load_state_dict(
    {k.replace('layers.', 'model.'): v for k, v in ckpt.items()},
    strict=False)
```

---

## 🧠 Architecture: DiNO × YOLO × I-JEPA

We combine three ideas to build a high-performance backbone for spatial representation learning:

#### 1️⃣ **Multi-Resolution DINO Setup (instead of local-global)**
> In standard [DINO](https://arxiv.org/abs/2104.14294) / [DINOv2](https://arxiv.org/abs/2304.07193), the student sees cropped or distorted views (local), while the teacher sees global views.  
> In `core-dino`, we replace this with **clean vs degraded resolution contrast**:
- 👨‍🏫 **Teacher** gets clean 30 cm satellite imagery.
- 👨‍🎓 **Student** sees augmented versions of the same scene at varying resolutions (30 cm → 2 m) with photometric and spatial distortions.

This setup encourages the model to learn **scale-invariant** and **semantic-aware** features across real-world EO resolution gaps.

#### 2️⃣ **I-JEPA-Style Patch Dropping**
We integrate ideas from [I-JEPA](https://arxiv.org/abs/2301.08243):
- Random **patch regions are dropped** from the student input.
- The objective is to align the **visible patch embeddings** with the teacher’s corresponding high-resolution ones.
- This enforces **local-global and partial-whole consistency** in the latent space.

#### 3️⃣ **YOLOv11-X as Encoder Backbone**
- We use **YOLOv11-X**, one of the most powerful and recent YOLO variants, as the spatial encoder.
- The backbone is **truncated after 23 layers**, retaining rich spatial semantics while maintaining efficiency.
- This provides strong priors from supervised detection tasks, now adapted for **self-supervised** learning.


---

## 🧪 Training Flow: Resolution-Agnostic DiNO

The training pipeline in `core-dino` follows a student-teacher design inspired by DINO, but adapted for real-world satellite imagery:

#### 👨‍🏫  1. Teacher View (Clean & High-Res)
- Receives a **clean 30 cm image** without any augmentation.
- Used as the stable reference to guide the student.

#### 👨‍🎓 2. Student View (Augmented Multi-Resolution)
- Receives **randomly augmented** versions of the same image:
  - Downsampled to **30 cm to 2 m**
  - Augmented with noise, blur, color jitter, spatial dropout, etc.
- Emulates resolution variability common in EO imagery.

#### ⚠️ 3. Spatial Misalignment & Solution
- Since different student resolutions produce different spatial dimensions (H × W),  
  we use **bilinear interpolation** to **resize the student’s feature map** to match the teacher's spatial shape before computing the contrastive loss.

#### 🎯 4. Objective
- Align the spatial token embeddings of the student with the teacher — pixel-to-pixel and semantically — despite resolution gaps and augmentations.
- Encourages **scale-invariant**, **robust** feature learning across real-world variations.



---

## 📈 Performance: Latent Quality & Downstream Evaluation

Despite being trained without any labels, `core-dino` demonstrates strong latent alignment and generalization capability — both in visual similarity and downstream tasks.

### 🛣️ Downstream: Road Extraction (DeepGlobe Dataset)

We evaluated `core-dino` on the [DeepGlobe Road Extraction Dataset](https://competitions.codalab.org/competitions/18467#learn_the_details), using it as a frozen backbone in a simple segmentation pipeline.

- **Setup:**  
  - Both `core-dino` and **YOLOv11-X** backbones were **frozen**  
  - Only a **2-layer convolutional head** was trained  
  - Task: Binary road segmentation using IoU loss

- **Result:**  
  - `core-dino` consistently outperformed the supervised **YOLOv11-X** backbone across all epochs  
  - Shows superior latent representation quality, even without task-specific supervision  
  - Demonstrates better **generalization** and **semantic robustness** in downstream transfer tasks

<p style="display: inline-flex; align-items: center; gap: 8px; margin-top: 10px;">
  <span style="font-size: 16px;">📓 <strong>Reproduce this comparison in Colab:</strong></span>
  <a href="https://colab.research.google.com/drive/1JqJoboLljDc2EoqMvj40mA1Sa1vnCHry" target="_blank" style="display: inline-block; vertical-align: middle;">
    <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab">
  </a>
</p>
<p align="center"><br>
  <img src="assets/downstream-deepglobe-roads.png" alt="Downstream Performance" style="width:85%;">
</p>


### 🏙️ Downstream : Building Footprint Validation

To evaluate transferability to structural segmentation tasks, we tested `core-dino` on **building footprint extraction** using high-resolution satellite imagery.

- **Setup:**  
  - Compared **YOLOv11-X (original weights)** vs. **YOLOv11-X initialized with `core-dino` weights**  
  - Used same training pipeline for both

- **Result:**  
  - `core-dino` achieved **+15 mAP** improvement over standard YOLOv11-X  
  - Captures edge-localized and compact building structures better  
  - Demonstrates strong spatial precision and high-quality feature encoding

<p style="display: inline-flex; align-items: center; gap: 8px; margin-top: 10px;">
  <span style="font-size: 16px;">📓 <strong>Reproduce this comparison in Colab:</strong></span>
  <a href="https://colab.research.google.com/drive/1uAqUNUDQt0_29Zhvopz0rWVSAzX-cZrk" target="_blank" style="display: inline-block; vertical-align: middle;">
    <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab">
  </a>
</p>
<p align="center"><br>
  <img src="assets/downstream-building-footprint.png" alt="Downstream Performance" style="width:85%;">
</p>


---

## 🗂️ Model Details

| Field              | Value                                                        |
|--------------------|--------------------------------------------------------------|
| Parameters         | **56.7M**                                                    |
| Backbone Architecture | **YOLOv11 X**                        |
| Input Size         | **320 × 320 – 4096 × 4096**                                  |
| Patch Source       | [Core-Five](https://huggingface.co/datasets/gajeshladhar/core-five) |
| Resolutions        | 30 cm (clean) → 2 m (augmented)                              |
| Patch Drop         | I-JEPA-style masking                                         |
| Loss               | DINO contrastive loss                                        |
| Training Time      | ~48h on 1×A100                                               |


---
## 💳 License  

This project is released under the **[Creative Commons Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0)](https://creativecommons.org/licenses/by-nc/3.0/)** license.

> ✅ Free to use, share, and adapt for **non-commercial research**  
> ❌ **Commercial use is not permitted** without explicit permission  
> 📌 Please provide appropriate credit when using this dataset in publications or projects.