Papers
arxiv:2507.04562

Evaluating LLMs on Real-World Forecasting Against Human Superforecasters

Published on Jul 6
· Submitted by jannalu on Jul 8
Authors:

Abstract

State-of-the-art large language models are evaluated on forecasting questions and show lower accuracy compared to human superforecasters.

AI-generated summary

Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, but their ability to forecast future events remains understudied. A year ago, large language models struggle to come close to the accuracy of a human crowd. I evaluate state-of-the-art LLMs on 464 forecasting questions from Metaculus, comparing their performance against human superforecasters. Frontier models achieve Brier scores that ostensibly surpass the human crowd but still significantly underperform a group of superforecasters.

Community

Paper submitter
edited 1 day ago

we do things because we thought they were easy :')

start discuss

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.04562 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.04562 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.04562 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.