From Pixels to Words -- Towards Native Vision-Language Primitives at Scale Paper • 2510.14979 • Published 19 days ago • 65
GSSF: Generalized Structural Sparse Function for Deep Cross-modal Metric Learning Paper • 2410.15266 • Published Oct 20, 2024
EVEv2: Improved Baselines for Encoder-Free Vision-Language Models Paper • 2502.06788 • Published Feb 10 • 13
Autoregressive Video Generation without Vector Quantization Paper • 2412.14169 • Published Dec 18, 2024 • 14
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception Paper • 2407.08303 • Published Jul 11, 2024 • 19
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning Paper • 2407.07523 • Published Jul 10, 2024 • 6
Deep Boosting Learning: A Brand-new Cooperative Approach for Image-Text Matching Paper • 2404.18114 • Published Apr 28, 2024
UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and Memory Paper • 2308.14316 • Published Aug 28, 2023
Similarity Reasoning and Filtration for Image-Text Matching Paper • 2101.01368 • Published Jan 5, 2021