zzfive
's Collections
benchmark
updated
GATE OpenING: A Comprehensive Benchmark for Judging Open-ended
Interleaved Image-Text Generation
Paper
•
2411.18499
•
Published
•
18
VLSBench: Unveiling Visual Leakage in Multimodal Safety
Paper
•
2411.19939
•
Published
•
9
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand
Audio-Visual Information?
Paper
•
2412.02611
•
Published
•
23
U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills
in LLMs
Paper
•
2412.03205
•
Published
•
16
ProcessBench: Identifying Process Errors in Mathematical Reasoning
Paper
•
2412.06559
•
Published
•
77
OmniDocBench: Benchmarking Diverse PDF Document Parsing with
Comprehensive Annotations
Paper
•
2412.07626
•
Published
•
22
VisionArena: 230K Real World User-VLM Conversations with Preference
Labels
Paper
•
2412.08687
•
Published
•
13
SCBench: A KV Cache-Centric Analysis of Long-Context Methods
Paper
•
2412.10319
•
Published
•
9
Are Your LLMs Capable of Stable Reasoning?
Paper
•
2412.13147
•
Published
•
91
Multi-Dimensional Insights: Benchmarking Real-World Personalization in
Large Multimodal Models
Paper
•
2412.12606
•
Published
•
41
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World
Tasks
Paper
•
2412.14161
•
Published
•
50
RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented
Generation for Preference Alignment
Paper
•
2412.13746
•
Published
•
9
CodeElo: Benchmarking Competition-level Code Generation of LLMs with
Human-comparable Elo Ratings
Paper
•
2501.01257
•
Published
•
47
MotionBench: Benchmarking and Improving Fine-grained Video Motion
Understanding for Vision Language Models
Paper
•
2501.02955
•
Published
•
40
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video
Understanding?
Paper
•
2501.05510
•
Published
•
35
WebWalker: Benchmarking LLMs in Web Traversal
Paper
•
2501.07572
•
Published
•
18
HALoGEN: Fantastic LLM Hallucinations and Where to Find Them
Paper
•
2501.08292
•
Published
•
16
MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents
Paper
•
2501.08828
•
Published
•
25