FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs
Abstract
FinAuditing is a benchmark for evaluating LLMs on structured financial auditing tasks, revealing their limitations in handling taxonomy-driven, hierarchical financial documents.
The complexity of the Generally Accepted Accounting Principles (GAAP) and the hierarchical structure of eXtensible Business Reporting Language (XBRL) filings make financial auditing increasingly difficult to automate and verify. While large language models (LLMs) have demonstrated strong capabilities in unstructured text understanding, their ability to reason over structured, interdependent, and taxonomy-driven financial documents remains largely unexplored. To fill this gap, we introduce FinAuditing, the first taxonomy-aligned, structure-aware, multi-document benchmark for evaluating LLMs on financial auditing tasks. Built from real US-GAAP-compliant XBRL filings, FinAuditing defines three complementary subtasks, FinSM for semantic consistency, FinRE for relational consistency, and FinMR for numerical consistency, each targeting a distinct aspect of structured auditing reasoning. We further propose a unified evaluation framework integrating retrieval, classification, and reasoning metrics across these subtasks. Extensive zero-shot experiments on 13 state-of-the-art LLMs reveal that current models perform inconsistently across semantic, relational, and mathematical dimensions, with accuracy drops of up to 60-90% when reasoning over hierarchical multi-document structures. Our findings expose the systematic limitations of modern LLMs in taxonomy-grounded financial reasoning and establish FinAuditing as a foundation for developing trustworthy, structure-aware, and regulation-aligned financial intelligence systems. The benchmark dataset is available at Hugging Face.
Community
FinAuditing is a taxonomy aligned and structure aware benchmark designed to evaluate large language models (LLMs) on financial auditing reasoning. It is built from real US GAAP compliant XBRL filings and defines three complementary subtasks: FinSM for semantic consistency, FinRE for relational consistency, and FinMR for numerical consistency. These tasks jointly assess models’ ability to reason over hierarchical and interdependent financial documents. Experiments on 13 state of the art LLMs reveal substantial performance gaps, highlighting the challenges of structured and taxonomy grounded financial reasoning and underscoring the need for more trustworthy and regulation aligned financial intelligence systems.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- FinReflectKG -- MultiHop: Financial QA Benchmark for Reasoning with Knowledge Graph Evidence (2025)
- FinReflectKG: Agentic Construction and Evaluation of Financial Knowledge Graphs (2025)
- FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering (2025)
- Towards a Holistic and Automated Evaluation Framework for Multi-Level Comprehension of LLMs in Book-Length Contexts (2025)
- ReTraceQA: Evaluating Reasoning Traces of Small Language Models in Commonsense Question Answering (2025)
- FinMR: A Knowledge-Intensive Multimodal Benchmark for Advanced Financial Reasoning (2025)
- EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 3
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper