Paper Trial - 2025
2025 Weekly Paper Reading Challenge: This is a journey of reading and sharing insightful research papers throughout 2025.
- Paper • 2412.06781 • Published • 21
MiniMax-01: Scaling Foundation Models with Lightning Attention
Paper • 2501.08313 • Published • 283
Do computer vision foundation models learn the low-level characteristics of the human visual system?
Paper • 2502.20256 • PublishedNote The study investigates whether computer vision foundation models (e.g., DINO [9], OpenCLIP [22, 33]) exhibit similarities to the human visual system (HVS) in low-level perceptual tasks, such as contrast detection, masking, and constancy. OpenCLIP and DINOv2’s partial success suggests language-supervised and self-supervised learning may implicitly model some HVS features.
High-Resolution Building and Road Detection from Sentinel-2
Paper • 2310.11622 • PublishedNote Bridge the resolution gap: - Inputs: low-resolution Sentinel-2 imagery (10 m). - Outputs: high-resolution (50 cm) building and road segmentation, counting, and height estimation. Effective teacher-student paradigm, a teacher model is trained on high-res data and the student model use pseudo-labels from teacher model
GeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization
Paper • 2309.16020 • PublishedNote Reframes geo-localization as an image-to-GPS retrieval problem: - Image Encoder: Uses a CLIP-based Vision Transformer (ViT) - Location Encoder: The multi-scale approach with Random Fourier Features at various frequencies effectively captures both coarse- and fine-grained location information. - Contrastive Learning Objective
Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment
Paper • 2312.06960 • PublishedNote - Genuinely Unsupervised Vision-Language Alignment without text annotations, by using Ground Imagery as an intermediary - Two encoders: image-level encoder and pixel level encoder, and image-to-image global contrastive loss and pixel-to-image contrastive loss (sparse pixel alignment)
ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features
Paper • 2502.04320 • Published • 36