FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering
Abstract
FinLFQA evaluates LLMs' ability to provide reliable and nuanced attributions in long-form financial question answering through human and automatic assessments.
Large Language Models (LLMs) frequently hallucinate to long-form questions, producing plausible yet factually incorrect answers. A common mitigation strategy is to provide attribution to LLM outputs. However, existing benchmarks primarily focus on simple attribution that retrieves supporting textual evidence as references. We argue that in real-world scenarios such as financial applications, attribution goes beyond reference retrieval. We introduce FinLFQA, a benchmark designed to evaluate the ability of LLMs to generate long-form answers to complex financial questions with reliable and nuanced attributions. FinLFQA evaluates three critical aspects of attribution through human annotations: (1) supporting evidence extracted from financial reports, (2) intermediate numerical reasoning steps, and (3) domain-specific financial knowledge that informs the reasoning process. We further provide an automatic evaluation framework covering both answer quality and attribution quality. Through extensive experiments on eight LLMs across multiple attribution-generation paradigms, we find that fine-grained metrics are important to distinguish model capabilities, that end-to-end generation achieves comparable performance to post-hoc approaches, and that iterative refinement only helps when guided by external feedback.
Community
This paper introduces a financial long-form QA benchmark and automated evaluation requiring LLMs to produce clause-level attributions, including supporting evidence, executable numerical reasoning, and domain knowledge.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- KnowMT-Bench: Benchmarking Knowledge-Intensive Long-Form Question Answering in Multi-Turn Dialogues (2025)
- LongRecall: A Structured Approach for Robust Recall Evaluation in Long-Form Text (2025)
- Decomposing and Revising What Language Models Generate (2025)
- XLQA: A Benchmark for Locale-Aware Multilingual Open-Domain Question Answering (2025)
- Evaluating Open-Source Large Language Models for Technical Telecom Question Answering (2025)
- Let's Use ChatGPT To Write Our Paper! Benchmarking LLMs To Write the Introduction of a Research Paper (2025)
- KERAG: Knowledge-Enhanced Retrieval-Augmented Generation for Advanced Question Answering (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper