VideoVista-CulturalLingo: 360^circ Horizons-Bridging Cultures, Languages, and Domains in Video Comprehension
Abstract
Assessing the video comprehension capabilities of multimodal AI systems can effectively measure their understanding and reasoning abilities. Most video evaluation benchmarks are limited to a single language, typically English, and predominantly feature videos rooted in Western cultural contexts. In this paper, we present VideoVista-CulturalLingo, the first video evaluation benchmark designed to bridge cultural, linguistic, and domain divide in video comprehension. Our work differs from existing benchmarks in the following ways: 1) Cultural diversity, incorporating cultures from China, North America, and Europe; 2) Multi-linguistics, with questions presented in Chinese and English-two of the most widely spoken languages; and 3) Broad domain, featuring videos sourced from hundreds of human-created domains. VideoVista-CulturalLingo contains 1,389 videos and 3,134 QA pairs, and we have evaluated 24 recent open-source or proprietary video large models. From the experiment results, we observe that: 1) Existing models perform worse on Chinese-centric questions than Western-centric ones, particularly those related to Chinese history; 2) Current open-source models still exhibit limitations in temporal understanding, especially in the Event Localization task, achieving a maximum score of only 45.2%; 3) Mainstream models demonstrate strong performance in general scientific questions, while open-source models demonstrate weak performance in mathematics.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction (2025)
- IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs (2025)
- InstructionBench: An Instructional Video Understanding Benchmark (2025)
- Generative Frame Sampler for Long Video Understanding (2025)
- ACVUBench: Audio-Centric Video Understanding Benchmark (2025)
- MAVERIX: Multimodal Audio-Visual Evaluation Reasoning IndeX (2025)
- Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper