arxiv:2506.12713

Humanity's Last Code Exam: Can Advanced LLMs Conquer Human's Hardest Code Competition?

Published on Jun 15

Authors:

Abstract

The Human Last Code Exam (HLCE) introduces a benchmark for advanced code generation using challenging programming contest problems, revealing substantial areas for improvement in current LLMs.

AI-generated summary

Code generation is a core capability of large language models (LLMs), yet mainstream benchmarks (e.g., APPs and LiveCodeBench) contain questions with medium-level difficulty and pose no challenge to advanced LLMs. To better reflected the advanced reasoning and code generation ability, We introduce Humanity's Last Code Exam (HLCE), comprising 235 most challenging problems from the International Collegiate Programming Contest (ICPC World Finals) and the International Olympiad in Informatics (IOI) spanning 2010 - 2024. As part of HLCE, we design a harmonized online-offline sandbox that guarantees fully reproducible evaluation. Through our comprehensive evaluation, we observe that even the strongest reasoning LLMs: o4-mini(high) and Gemini-2.5 Pro, achieve pass@1 rates of only 15.9% and 11.4%, respectively. Meanwhile, we propose a novel "self-recognition" task to measure LLMs' awareness of their own capabilities. Results indicate that LLMs' self-recognition abilities are not proportionally correlated with their code generation performance. Finally, our empirical validation of test-time scaling laws reveals that current advanced LLMs have substantial room for improvement on complex programming tasks. We expect HLCE to become a milestone challenge for code generation and to catalyze advances in high-performance reasoning and human-AI collaborative programming. Our code and dataset are also public available(https://github.com/Humanity-s-Last-Code-Exam/HLCE).

View arXiv page View PDF Add to collection

Community

CO-IR

about 6 hours ago

We are honored to present Humanity's Last Code Exam, a benchmark composed of competition problems from the ICPC World Finals and IOI spanning 2010 to 2024. Even the strongest o4-mini(high) achieves only a 15% pass@1 rate. Our findings further reveal that the reasoning capabilities of current models are far from their theoretical limits, and the test time scaling law holds potential for continued improvement. Project repository: https://github.com/Humanity-s-Last-Code-Exam/HLCE

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.12713 in a model README.md to link it from this page.

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.12713 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.