Confidence and Stability of Global and Pairwise Scores in NLP Evaluation Paper • 2507.01633 • Published 15 days ago
IMDB-WIKI-SbS: An Evaluation Dataset for Crowdsourced Pairwise Comparisons Paper • 2110.14990 • Published Oct 28, 2021
Towards Realistic Evaluation of Commit Message Generation by Matching Online and Offline Settings Paper • 2410.12046 • Published Oct 15, 2024
Towards Realistic Evaluation of Commit Message Generation by Matching Online and Offline Settings Paper • 2410.12046 • Published Oct 15, 2024
Towards Realistic Evaluation of Commit Message Generation by Matching Online and Offline Settings Paper • 2410.12046 • Published Oct 15, 2024
Long Code Arena: a Set of Benchmarks for Long-Context Code Models Paper • 2406.11612 • Published Jun 17, 2024 • 25
PSIMiner: A Tool for Mining Rich Abstract Syntax Trees from Code Paper • 2103.12778 • Published Mar 23, 2021
Long Code Arena: a Set of Benchmarks for Long-Context Code Models Paper • 2406.11612 • Published Jun 17, 2024 • 25
Long Code Arena: a Set of Benchmarks for Long-Context Code Models Paper • 2406.11612 • Published Jun 17, 2024 • 25
Long Code Arena: a Set of Benchmarks for Long-Context Code Models Paper • 2406.11612 • Published Jun 17, 2024 • 25
On The Importance of Reasoning for Context Retrieval in Repository-Level Code Editing Paper • 2406.04464 • Published Jun 6, 2024