xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations Paper • 2408.12590 • Published Aug 22, 2024 • 37
xLAM: A Family of Large Action Models to Empower AI Agent Systems Paper • 2409.03215 • Published Sep 5, 2024 • 5
SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant Paper • 2403.11299 • Published Mar 17, 2024 • 1
LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer Paper • 2212.09877 • Published Dec 19, 2022
Trust but Verify: Programmatic VLM Evaluation in the Wild Paper • 2410.13121 • Published Oct 17, 2024 • 3
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs Paper • 2410.16267 • Published Oct 21, 2024 • 18
BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions Paper • 2411.07461 • Published Nov 12, 2024 • 24
ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models Paper • 2412.07012 • Published Dec 9, 2024 • 1
DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs Paper • 2504.17040 • Published Apr 23 • 13
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset Paper • 2505.09568 • Published May 14 • 94
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset Paper • 2505.09568 • Published May 14 • 94