Improve model card with metadata and paper link
Browse filesThis PR improves the model card by:
- Adding metadata including `library_name`, `pipeline_tag`, and `license`.
- Linking to the associated paper [More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models](https://huggingface.co/papers/2505.21523).
This ensures better discoverability and provides essential information for users.
README.md
CHANGED
@@ -1 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
This repository contains the model presented in [Ocean-R1: An Open and Generalizable Large Vision-Language Model enhanced by Reinforcement Learning](https://github.com/VLM-RL/Ocean-R1).
|
|
|
1 |
+
---
|
2 |
+
license: cc-by-4.0
|
3 |
+
library_name: transformers
|
4 |
+
pipeline_tag: image-text-to-text
|
5 |
+
---
|
6 |
+
|
7 |
+
# More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models
|
8 |
+
|
9 |
+
The model was presented in the paper [More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models](https://huggingface.co/papers/2505.21523).
|
10 |
+
|
11 |
+
# Paper abstract
|
12 |
+
|
13 |
+
Test-time compute has empowered multimodal large language models to generate extended reasoning chains, yielding strong performance on tasks such as multimodal math reasoning. However, this improved reasoning ability often comes with increased hallucination: as generations become longer, models tend to drift away from image-grounded content and rely more heavily on language priors. Attention analysis shows that longer reasoning chains lead to reduced focus on visual inputs, which contributes to hallucination. To systematically study this phenomenon, we introduce RH-AUC, a metric that quantifies how a model's perception accuracy changes with reasoning length, allowing us to evaluate whether the model preserves visual grounding during reasoning. We also release RH-Bench, a diagnostic benchmark that spans a variety of multimodal tasks, designed to assess the trade-off between reasoning ability and hallucination. Our analysis reveals that (i) larger models typically achieve a better balance between reasoning and perception, and (ii) this balance is influenced more by the types and domains of training data than by its overall volume. These findings underscore the importance of evaluation frameworks that jointly consider both reasoning quality and perceptual fidelity.
|
14 |
+
|
15 |
This repository contains the model presented in [Ocean-R1: An Open and Generalizable Large Vision-Language Model enhanced by Reinforcement Learning](https://github.com/VLM-RL/Ocean-R1).
|