Kvasir-VQA-x1: A Multimodal Dataset for Medical Reasoning and Robust MedVQA in Gastrointestinal Endoscopy
Abstract
Kvasir-VQA-x1, an expanded dataset for gastrointestinal endoscopy, addresses clinical complexity and visual diversity with large-scale question-answer pairs and visual augmentations to enhance multimodal AI system reliability in clinical settings.
Medical Visual Question Answering (MedVQA) is a promising field for developing clinical decision support systems, yet progress is often limited by the available datasets, which can lack clinical complexity and visual diversity. To address these gaps, we introduce Kvasir-VQA-x1, a new, large-scale dataset for gastrointestinal (GI) endoscopy. Our work significantly expands upon the original Kvasir-VQA by incorporating 159,549 new question-answer pairs that are designed to test deeper clinical reasoning. We developed a systematic method using large language models to generate these questions, which are stratified by complexity to better assess a model's inference capabilities. To ensure our dataset prepares models for real-world clinical scenarios, we have also introduced a variety of visual augmentations that mimic common imaging artifacts. The dataset is structured to support two main evaluation tracks: one for standard VQA performance and another to test model robustness against these visual perturbations. By providing a more challenging and clinically relevant benchmark, Kvasir-VQA-x1 aims to accelerate the development of more reliable and effective multimodal AI systems for use in clinical settings. The dataset is fully accessible and adheres to FAIR data principles, making it a valuable resource for the wider research community. Code and data: https://github.com/Simula/Kvasir-VQA-x1 and https://huggingface.co/datasets/SimulaMet/Kvasir-VQA-x1
Community
Kvasir-VQA-x1
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays (2025)
- Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning (2025)
- AMQA: An Adversarial Dataset for Benchmarking Bias of LLMs in Medicine and Healthcare (2025)
- ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning through Step-by-Step Verification (2025)
- Bridging the Semantic Gaps: Improving Medical VQA Consistency with LLM-Augmented Question Sets (2025)
- DrVD-Bench: Do Vision-Language Models Reason Like Human Doctors in Medical Image Diagnosis? (2025)
- BYO-Eval: Build Your Own Dataset for Fine-Grained Visual Assessment of Multimodal Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper