GIMMICK -- Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking
Abstract
Large Vision-Language Models (LVLMs) have recently gained attention due to their distinctive performance and broad applicability. While it has been previously shown that their efficacy in usage scenarios involving non-Western contexts falls short, existing studies are limited in scope, covering just a narrow range of cultures, focusing exclusively on a small number of cultural aspects, or evaluating a limited selection of models on a single task only. Towards globally inclusive LVLM research, we introduce GIMMICK, an extensive multimodal benchmark designed to assess a broad spectrum of cultural knowledge across 144 countries representing six global macro-regions. GIMMICK comprises six tasks built upon three new datasets that span 728 unique cultural events or facets on which we evaluated 20 LVLMs and 11 LLMs, including five proprietary and 26 open-weight models of all sizes. We systematically examine (1) regional cultural biases, (2) the influence of model size, (3) input modalities, and (4) external cues. Our analyses reveal strong biases toward Western cultures across models and tasks and highlight strong correlations between model size and performance, as well as the effectiveness of multimodal input and external geographic cues. We further find that models have more knowledge of tangible than intangible aspects (e.g., food vs. rituals) and that they excel in recognizing broad cultural origins but struggle with a more nuanced understanding.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries (2025)
- Exploring Vision Language Models for Multimodal and Multilingual Stance Detection (2025)
- FaceXBench: Evaluating Multimodal LLMs on Face Understanding (2025)
- GePBench: Evaluating Fundamental Geometric Perception for Multimodal Large Language Models (2024)
- Dynamic Knowledge Integration for Enhanced Vision-Language Reasoning (2025)
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment (2024)
- mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 3
Spaces citing this paper 0
No Space linking this paper