view article Article SmolLM3: smol, multilingual, long-context reasoner By loubnabnl and 22 others • 2 days ago • 412
How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks Paper • 2507.01955 • Published 8 days ago • 29
V-JEPA 2 Collection A frontier video understanding model developed by FAIR, Meta, which extends the pretraining objectives of https://ai.meta.com/blog/v-jepa-yann • 8 items • Updated 27 days ago • 144
ERNIE 4.5 Collection collection of ERNIE 4.5 models. "-Paddle" models use PaddlePaddle weights, while "-PT" models use Transformer-style PyTorch weights. • 23 items • Updated 7 days ago • 146
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities Paper • 2401.12168 • Published Jan 22, 2024 • 29
view article Article Gemma 3n fully available in the open-source ecosystem! By ariG23498 and 7 others • 14 days ago • 105
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models Paper • 2403.13372 • Published Mar 20, 2024 • 107
Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models Paper • 2505.17015 • Published May 22 • 9
view article Article nanoVLM: The simplest repository to train your VLM in pure PyTorch By ariG23498 and 6 others • May 21 • 185
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset Paper • 2505.09568 • Published May 14 • 94
Qwen2.5-Omni Collection End-to-End Omni (text, audio, image, video, and natural speech interaction) model based Qwen2.5 • 7 items • Updated May 21 • 148
Qwen2-VL Collection Vision-language model series based on Qwen2 • 16 items • Updated Apr 28 • 220
view article Article Vision Language Models (Better, Faster, Stronger) By merve and 4 others • May 12 • 474