xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations Paper • 2408.12590 • Published Aug 22, 2024 • 37
xLAM: A Family of Large Action Models to Empower AI Agent Systems Paper • 2409.03215 • Published Sep 5, 2024 • 5
SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant Paper • 2403.11299 • Published Mar 17, 2024 • 1
LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer Paper • 2212.09877 • Published Dec 19, 2022
Trust but Verify: Programmatic VLM Evaluation in the Wild Paper • 2410.13121 • Published Oct 17, 2024 • 3
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs Paper • 2410.16267 • Published Oct 21, 2024 • 18
BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions Paper • 2411.07461 • Published Nov 12, 2024 • 24
ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models Paper • 2412.07012 • Published Dec 9, 2024 • 1
DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs Paper • 2504.17040 • Published Apr 23 • 13
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset Paper • 2505.09568 • Published May 14 • 94
Deformer: Dynamic Fusion Transformer for Robust Hand Pose Estimation Paper • 2303.04991 • Published Mar 9, 2023
X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning Paper • 2311.18799 • Published Nov 30, 2023 • 1
Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization Paper • 2308.02151 • Published Aug 4, 2023 • 19
BOLAA: Benchmarking and Orchestrating LLM-augmented Autonomous Agents Paper • 2308.05960 • Published Aug 11, 2023 • 19
FOFO: A Benchmark to Evaluate LLMs' Format-Following Capability Paper • 2402.18667 • Published Feb 28, 2024
ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding Paper • 2212.05171 • Published Dec 10, 2022
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens Paper • 2406.11271 • Published Jun 17, 2024 • 21
MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases Paper • 2406.10290 • Published Jun 12, 2024