Submitted by Hennara 96 Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR · 7 authors 2
Submitted by taesiri 39 MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe · 34 authors 22k 3
Submitted by Silin-Chen 30 SWE-QA: Can Language Models Answer Repository-level Code Questions? · 6 authors 17 2
Submitted by Two-hot 25 How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective · 18 authors 10 2
Submitted by taesiri 20 Hyper-Bagel: A Unified Acceleration Framework for Multimodal Understanding and Generation · 7 authors 2
Submitted by lhmd 19 VolSplat: Rethinking Feed-Forward 3D Gaussian Splatting with Voxel-Aligned Prediction · 10 authors 59 4
Submitted by taesiri 16 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation · 13 authors 225 4
Submitted by Yunzhen 14 What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT · 5 authors 2
Submitted by ZipW 6 HyRF: Hybrid Radiance Fields for Memory-efficient and High-quality Novel View Synthesis · 2 authors 31 2
Submitted by MinhDucBui 6 Large Language Models Discriminate Against Speakers of German Dialects · 5 authors 2
Submitted by ultra7chen 3 CAR-Flow: Condition-Aware Reparameterization Aligns Source and Target for Better Flow Matching · 10 authors 2
Submitted by emilia-wisnios 3 OpenGVL - Benchmarking Visual Temporal Progress for Data Curation · 6 authors 2
Submitted by Fictionary 2 GeoSVR: Taming Sparse Voxels for Geometrically Accurate Surface Reconstruction · 7 authors 35 2
Submitted by spapi 2 Better Late Than Never: Evaluation of Latency Metrics for Simultaneous Speech-to-Text Translation · 4 authors 2
Submitted by taesiri 1 Zero-Shot Multi-Spectral Learning: Reimagining a Generalist Multimodal Gemini 2.5 Model for Remote Sensing Applications · 7 authors 2
Submitted by conan1024hao 1 VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction · 14 authors 1 2
Submitted by abhilekhborah - DRISHTIKON: A Multimodal Multilingual Benchmark for Testing Language Models' Understanding on Indian Culture · 9 authors 2
Submitted by jesbu1 - PEEK: Guiding and Minimal Image Representations for Zero-Shot Generalization of Robot Manipulation Policies · 9 authors 1 2