VDT: General-purpose Video Diffusion Transformers via Mask Modeling Paper • 2305.13311 • Published May 22, 2023
WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training Paper • 2103.06561 • Published Mar 11, 2021
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism Paper • 2401.02954 • Published Jan 5, 2024 • 49
DeepSeek-VL: Towards Real-World Vision-Language Understanding Paper • 2403.05525 • Published Mar 8, 2024 • 47
UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling Paper • 2302.06605 • Published Feb 13, 2023
Needle In A Video Haystack: A Scalable Synthetic Framework for Benchmarking Video MLLMs Paper • 2406.09367 • Published Jun 13, 2024
Beyond Filtering: Adaptive Image-Text Quality Enhancement for MLLM Pretraining Paper • 2410.16166 • Published Oct 21, 2024
Kimi k1.5: Scaling Reinforcement Learning with LLMs Paper • 2501.12599 • Published Jan 22 • 118
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization Paper • 2503.10615 • Published Mar 13 • 17
VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning? Paper • 2505.23359 • Published 14 days ago • 39
Kimi-VL-A3B Collection Moonshot's efficient MoE VLMs, exceptional on agent, long-context, and thinking • 6 items • Updated Apr 12 • 65
Running on Zero 103 103 Chat with Kimi-VL-A3B-Thinking 🤔 Chat with Kimi-VL-A3B-Thinking using text and images
Running on Zero 103 103 Chat with Kimi-VL-A3B-Thinking 🤔 Chat with Kimi-VL-A3B-Thinking using text and images