ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation
Abstract
ArtifactsBench, a novel benchmark and evaluation framework, automates the assessment of visual code generation quality using temporal screenshots and a multimodal language model judge.
The generative capabilities of Large Language Models (LLMs) are rapidly expanding from static code to dynamic, interactive visual artifacts. This progress is bottlenecked by a critical evaluation gap: established benchmarks focus on algorithmic correctness and are blind to the visual fidelity and interactive integrity that define modern user experiences. To bridge this gap, we introduce ArtifactsBench, a new benchmark and paradigm for the automated, multimodal evaluation of visual code generation. Our framework programmatically renders each generated artifact and captures its dynamic behavior through temporal screenshots. This visual evidence, alongside the source code, is then assessed by a Multimodal LLM (MLLM)-as-Judge, which is rigorously guided by a fine-grained, per-task checklist to ensure holistic and reproducible scoring. We construct a new benchmark of 1,825 diverse tasks and evaluate over 30 leading LLMs. Our automated evaluation achieves a striking 94.4% ranking consistency with WebDev Arena, the gold-standard for human preference in web development, and over 90% pairwise agreement with human experts. This establishes ArtifactsBench as the first framework to reliably automate the assessment of human-perceived quality at scale. Our analysis provides a high-resolution map of the current SOTA, revealing that generalist models often outperform domain-specific ones. We open-source ArtifactsBench, including the benchmark, evaluation harness, and baseline results at https://artifactsbenchmark.github.io/, to provide the community with a scalable and accurate tool to accelerate the development of user-centric generative models.
Community
Tencent Hunyuan Releases ArtifactsBench: A Next-Generation “What-You-See-Is-What-You-Get” Evaluation Standard for Code Generation
ArtifactsBench is designed to comprehensively measure large language models (LLMs) on their ability to generate visually rich, interactive, and dynamic code artifacts. As AI code generation enters a new phase, ArtifactsBench provides the industry with a precise yardstick for evaluating and advancing models from “able to write code” to “able to write high-quality, user-friendly code.”
Facing the Challenge: Built for Visual and Interactive Code
Traditional programming benchmarks focus mainly on algorithmic correctness and overlook the crucial aspects of visual presentation and user experience in modern applications. ArtifactsBench is specifically created to fill this gap. It consists of 1,825 meticulously crafted tasks of unprecedented breadth and depth, covering nine real-world scenarios—from static web components and SVG data visualizations to mini-games and management systems with complex interaction logic. All tasks are stratified by difficulty, enabling systematic assessment of a model’s visual code-generation capabilities across varying complexity levels.
Core Innovation: A Fully Automated, Multimodal Evaluation Pipeline
The standout feature of ArtifactsBench is its novel multimodal, automated evaluation paradigm. The pipeline first uses scripted interactions with the model-generated visual artifacts (e.g., web pages, applications) while simultaneously recording screenshots and GIFs. These dynamic visual materials, along with task requirements, are then submitted to a Multimodal Large Language Model as Judge (MLLM-as-Judge) for evaluation. Guided by fine-grained, task-specific checklists, the judge delivers comprehensive, objective, and reproducible scores.
Value Validation: Highly Consistent with Human Experts
The authority of any benchmark hinges on the credibility of its conclusions. Therefore, we conducted a large-scale alignment study comparing ArtifactsBench’s automated evaluation results with the fully human-voted WebDev Arena. The findings reveal that ArtifactsBench’s model rankings achieve an impressive 94.4% consistency with human expert preferences. This remarkable figure demonstrates that ArtifactsBench’s automated evaluation workflow can reliably replace traditional manual assessments and become the gold standard for measuring the visual and interactive quality of code artifacts.
- 🌐 Project website: https://artifactsbenchmark.github.io/
- 📄 Paper: https://arxiv.org/abs/2507.04952
- 💻 Code: https://github.com/Tencent-Hunyuan/ArtifactsBenchmark
- 📬 Contact: [email protected]
腾讯混元重磅发布 ArtifactsBench:迈向“所见即所得”的下一代代码生成评测标准
ArtifactsBench 旨在全面衡量大语言模型(LLM)在生成视觉丰富、可交互的动态代码制品方面的能力。随着AI代码生成进入新阶段,ArtifactsBench 的出现,为业界提供了一把精准的标尺,以评估和推动模型从“能写代码”到“写出高品质、用户体验友好的代码”的跨越。
直面挑战:为视觉与交互代码而生
传统的编程评测大多聚焦于算法的逻辑正确性,却忽视了现代应用中至关重要的视觉呈现和用户交互体验。ArtifactsBench 正是为了填补这一空白而设计。它包含 1,825个精心构建的任务,其广度与深度前所未有,覆盖了从静态网页组件、SVG数据可视化,到具有复杂交互逻辑的小游戏和管理系统等九大真实世界场景。所有任务均按难度分层,能够系统性地评估模型在不同复杂度下的视觉代码生成能力。
核心创新:全自动、多模态的评测流程
ArtifactsBench 的最大亮点在于其新颖的 多模态自动化评测范式。该流程首先通过程序化脚本与模型生成的视觉制品(如网页、应用)并同步录制屏幕截图与GIF动图。随后,这些富含动态过程的视觉材料,将连同任务要求一起,交由一个“多模态大模型裁判”(MLLM-as-Judge)进行评估。该裁判依据为每个任务量身定制的细粒度清单,进行全面、客观且可复现的打分。
价值验证:与人类专家的眼光高度一致
一个评测基准的价值,取决于其结论的权威性。为此,我们将 ArtifactsBench 的自动评测结果与广受认可的、完全由人工投票裁决的 WebDev Arena 进行了大规模对齐验证。结果显示,ArtifactsBench 的模型排名与人类专家的偏好排序一致性高达 94.4%。这一惊人的数据有力地证明,ArtifactsBench 的自动化评估流程能够高度可靠地替代传统的人工评测,成为衡量代码制品视觉与交互质量的黄金标准。
- 🌐 Project website: https://artifactsbenchmark.github.io/
- 📄 Paper: https://arxiv.org/abs/2507.04952
- 💻 Code: https://github.com/Tencent-Hunyuan/ArtifactsBenchmark
- 📬 Contact: [email protected]
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper