MixEval

community

https://mixeval.github.io/

AI & ML interests

LLM & LMM evaluation

authored 4 papers 8 months ago

Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents

Paper • 2510.24702 • Published Oct 28, 2025 • 32

The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

Paper • 2510.25726 • Published Oct 29, 2025 • 47

Simulating Environments with Reasoning Models for Agent Training

Paper • 2511.01824 • Published Nov 3, 2025 • 2

On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models

Paper • 2512.07783 • Published Dec 8, 2025 • 41

authored 7 papers 9 months ago

Diffusion Language Models are Super Data Learners

Paper • 2511.03276 • Published Nov 5, 2025 • 132

Training Optimal Large Diffusion Language Models

Paper • 2510.03280 • Published Sep 28, 2025 • 1

Logical Reasoning over Natural Language as Knowledge Representation: A Survey

Paper • 2303.12023 • Published Mar 21, 2023 • 2

Recent Advances in Deep Learning Based Dialogue Systems: A Systematic Survey

Paper • 2105.04387 • Published May 10, 2021

Long-Context Inference with Retrieval-Augmented Speculative Decoding

Paper • 2502.20330 • Published Feb 27, 2025

SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis

Paper • 2506.02096 • Published Jun 2, 2025 • 53

MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

Paper • 2509.24002 • Published Sep 28, 2025 • 180

authored 4 papers 10 months ago

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

Paper • 2411.15296 • Published Nov 22, 2024 • 21

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

Paper • 2501.13826 • Published Jan 23, 2025 • 24

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

Paper • 2509.23661 • Published Sep 28, 2025 • 51

Visual Jigsaw Post-Training Improves MLLMs

Paper • 2509.25190 • Published Sep 29, 2025 • 37

authored 5 papers about 1 year ago

Small Models Struggle to Learn from Strong Reasoners

Paper • 2502.12143 • Published Feb 17, 2025 • 40

Evaluating Vision-Language Models as Evaluators in Path Planning

Paper • 2411.18711 • Published Nov 27, 2024

VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search

Paper • 2503.10582 • Published Mar 13, 2025 • 25

Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators

Paper • 2503.19877 • Published Mar 25, 2025 • 2

VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge

Paper • 2504.10342 • Published Apr 14, 2025 • 11