-
Self-Rewarding Language Models
Paper β’ 2401.10020 β’ Published β’ 147 -
Orion-14B: Open-source Multilingual Large Language Models
Paper β’ 2401.12246 β’ Published β’ 13 -
MambaByte: Token-free Selective State Space Model
Paper β’ 2401.13660 β’ Published β’ 54 -
MM-LLMs: Recent Advances in MultiModal Large Language Models
Paper β’ 2401.13601 β’ Published β’ 46
Collections
Discover the best community collections!
Collections including paper arxiv:2405.14129
-
MM-LLMs: Recent Advances in MultiModal Large Language Models
Paper β’ 2401.13601 β’ Published β’ 46 -
Orion-14B: Open-source Multilingual Large Language Models
Paper β’ 2401.12246 β’ Published β’ 13 -
Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model
Paper β’ 2405.09215 β’ Published β’ 20 -
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability
Paper β’ 2405.14129 β’ Published β’ 12
-
Vript: A Video Is Worth Thousands of Words
Paper β’ 2406.06040 β’ Published β’ 26 -
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
Paper β’ 2406.04325 β’ Published β’ 73 -
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
Paper β’ 2406.01574 β’ Published β’ 45 -
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Paper β’ 2405.21075 β’ Published β’ 22
-
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability
Paper β’ 2405.14129 β’ Published β’ 12 -
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Paper β’ 2405.09818 β’ Published β’ 130 -
VITA: Towards Open-Source Interactive Omni Multimodal LLM
Paper β’ 2408.05211 β’ Published β’ 47
-
Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs
Paper β’ 2403.12596 β’ Published β’ 10 -
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
Paper β’ 2404.13013 β’ Published β’ 31 -
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
Paper β’ 2404.16994 β’ Published β’ 36 -
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability
Paper β’ 2405.14129 β’ Published β’ 12
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper β’ 2402.04252 β’ Published β’ 26 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper β’ 2402.03749 β’ Published β’ 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper β’ 2402.04615 β’ Published β’ 41 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper β’ 2402.05008 β’ Published β’ 22