Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage Paper • 2412.15606 • Published Dec 20, 2024 • 2
LongViTU: Instruction Tuning for Long-Form Video Understanding Paper • 2501.05037 • Published Jan 9 • 1
Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World Paper • 2310.10207 • Published Oct 16, 2023
SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding Paper • 2401.09340 • Published Jan 17, 2024 • 22
3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment Paper • 2308.04352 • Published Aug 8, 2023
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding Paper • 2403.11481 • Published Mar 18, 2024 • 13
Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting Paper • 2403.15624 • Published Mar 22, 2024
Neural-Symbolic Recursive Machine for Systematic Generalization Paper • 2210.01603 • Published Oct 4, 2022
Perceive, Ground, Reason, and Act: A Benchmark for General-purpose Visual Representation Paper • 2211.15402 • Published Nov 28, 2022
OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents Paper • 2407.00114 • Published Jun 27, 2024 • 13
UltraEdit: Instruction-based Fine-Grained Image Editing at Scale Paper • 2407.05282 • Published Jul 7, 2024 • 15
FIRE: A Dataset for Feedback Integration and Refinement Evaluation of Multimodal Models Paper • 2407.11522 • Published Jul 16, 2024 • 9