admarcosai
's Collections
Benchmarks
updated
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
Paper
•
2311.12022
•
Published
•
25
GAIA: a benchmark for General AI Assistants
Paper
•
2311.12983
•
Published
•
183
Updated
•
121
•
63
Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models
Paper
•
2312.04724
•
Published
•
20
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Paper
•
2401.03065
•
Published
•
11
Narrowing the Knowledge Evaluation Gap: Open-Domain Question Answering
with Multi-Granularity Answers
Paper
•
2401.04695
•
Published
•
11
Updated
•
1.15k
•
70
Viewer
•
Updated
•
100
•
145
•
8
reasoning-machines/gsm-hard
Viewer
•
Updated
•
1.32k
•
445
•
36
TravelPlanner: A Benchmark for Real-World Planning with Language Agents
Paper
•
2402.01622
•
Published
•
33
Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool
Utilization in Real-World Complex Scenarios
Paper
•
2401.17167
•
Published
•
1
Language Models, Agent Models, and World Models: The LAW for Machine
Reasoning and Planning
Paper
•
2312.05230
•
Published
LongAlign: A Recipe for Long Context Alignment of Large Language Models
Paper
•
2401.18058
•
Published
•
21
Premise Order Matters in Reasoning with Large Language Models
Paper
•
2402.08939
•
Published
•
25
In Search of Needles in a 10M Haystack: Recurrent Memory Finds What LLMs
Miss
Paper
•
2402.10790
•
Published
•
40