arxiv:2311.13951

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

Published on Nov 23, 2023

Authors:

Wentao Ge ,

Abstract

A new evaluation paradigm for multimodal large language models (MLLMs) uses potent MLLMs as judges with per-sample criteria, validated by a benchmark with high agreement to human evaluations.

AI-generated summary

Multimodal large language models (MLLMs) have broadened the scope of AI applications. Existing automatic evaluation methodologies for MLLMs are mainly limited in evaluating queries without considering user experiences, inadequately addressing the nuances of creative and associative multimodal tasks. However, the open-ended and subjective nature of such tasks poses a significant challenge to the evaluation methodology, where it is difficult to define the ground-truth answers for them. To this end, in our paper, we propose a new evaluation paradigm for MLLMs, which is evaluating MLLMs with per-sample criteria using potent MLLM as the judge. To validate the feasibility and effectiveness of this paradigm, we design a benchmark, dubbed MLLM-Bench, by curating the evaluation samples across six comprehensive cognitive levels. We benchmark 21 popular MLLMs in a pairwise-comparison fashion, showing diverse performance across models. Moreover, the validity of our benchmark manifests itself in reaching 88.02% agreement with human evaluation. We contend that the proposed paradigm explores the potential of MLLMs as effective evaluation tools with the help of per-sample criteria. See online leaderboard at https://mllm-bench.llmzoo.com.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2311.13951 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2311.13951 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2311.13951 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.