HalluFin: Fine-Grained Hallucination Detection in Financial Texts Using Chain-of-Thought Reasoning

Tarun Bhatia
Apr 30
8 min read

Author: Tarun Bhatia, QFINTEC

Abstract

Large Language Models (LLMs) are increasingly utilized in financial analysis for tasks such as summarization, recommendation generation, and automated report drafting. However, their propensity to hallucinate—producing plausible yet factually incorrect statements—poses significant risks in high-stakes financial environments. This paper introduces HalluFin, a fine-grained hallucination detection system tailored to the financial domain. Building upon the HalluMeasure framework, originally designed for general summarization tasks, HalluFin integrates domain-specific datasets, financial ontologies, and expert-in-the-loop evaluations. Utilizing chain-of-thought (CoT) prompting, HalluFin achieves accurate claim verification and subtype error classification. Experiments on two curated datasets—FinNewsSumm and FinReportQA—demonstrate that HalluFin outperforms baseline models by over 12 points in F1 score, offering actionable insights on hallucination types unique to financial documents, such as "Temporal drift," "Over-attribution to entities," and "Projection-based overgeneralization."

1. Introduction

The integration of LLMs like GPT-4, Claude, and Mistral into financial workflows—from earnings summaries to portfolio commentaries—has introduced efficiency gains. However, the inherent risk of hallucinations in LLM outputs can lead to non-compliance, reputational damage, and market manipulation. Unlike consumer applications where hallucinations may be benign, financial outputs demand verifiability and traceability. This necessitates an automated hallucination detection pipeline that is both fine-grained and interpretable.

2. Related Work

Hallucination in large language models (LLMs) has become a rapidly growing area of research due to its implications for model trustworthiness, particularly in high-stakes domains like healthcare, law, and finance. Hallucination is broadly defined as the generation of content that is unfaithful or not grounded in source data—a critical risk when LLMs are used to analyze or summarize factual documents.

Several survey papers offer a comprehensive overview of hallucination phenomena, categorizing types, causes, and mitigation strategies (Huang et al., 2023; Wang et al., 2024; Rawte et al., 2023). These surveys differentiate between intrinsic hallucinations (contradicting known facts) and extrinsic hallucinations (fabricating plausible but unverifiable information).

Datasets and Benchmarks

A number of works have created hallucination evaluation datasets, including TruthfulQA (Lin et al., 2022), HallucEval (Li et al., 2023), and datasets focused on summarization like TechNewsSumm (Akbar et al., 2024) and WikiBio (Lebret et al., 2016). Many of these target either news domains (e.g., CNN/DailyMail; See et al., 2017) or structured data tasks such as biography generation. However, finance-specific hallucination datasets remain scarce, motivating our development of FinNewsSumm and FinReportQA.

Evaluation Metrics

Various automatic hallucination metrics have been proposed:

Pretrained NLI-based metrics (Kryscinski et al., 2020; Laban et al., 2021) use entailment classification between source and generated text.
QA-based evaluation methods generate questions from model outputs and attempt to answer them from the reference (Chern et al., 2023; Min et al., 2023).
Consistency scoring methods like AlignScore (Zha et al., 2023) and Vectara-HHEM rely on embedding similarity and trained alignment classifiers.
More recent models like RefChecker (Hu et al., 2023) and SelfCheckGPT (Manakul et al., 2023) use large LLMs to validate their own outputs.

Meta-evaluation frameworks like TRUE (Honovich et al., 2022) and GoFigure (Gabriel et al., 2021) compare these metrics' correlation with human annotations.

Granularity of Detection

Hallucination evaluation approaches differ in granularity:

Response-level detection is efficient but coarse (e.g., Vectara-HHEM, AlignScore).
Sentence-level classification is more precise (Kryscinski et al., 2020).
Claim-level analysis, as introduced by Min et al. (2023) and expanded in HalluMeasure (Akbar et al., 2024), allows for fine-grained identification of factual inconsistencies. Our work builds on this by introducing domain-specific subtypes for financial texts (e.g., regulatory misstatement, unsupported causality).

Hallucination Typologies

Several efforts have categorized hallucinations into error taxonomies:

Rawte et al. (2023) outline types like negation, entity swap, or exaggerated reasoning.
Tang et al. (2023, 2024) and Zhu et al. (2023) propose fine-grained typologies across different domains, including dialogue and summarization.

Our work extends this line of research by proposing ten hallucination subtypes specific to finance, such as projection exaggeration and temporal drift, which we evaluate in real-world applications like earnings summaries and ETF commentary.

Chain-of-Thought Reasoning for Hallucination Detection

Chain-of-Thought (CoT) prompting (Wei et al., 2022) has shown significant improvements in reasoning-heavy tasks and is now applied to hallucination detection. domain prompts, improving both accuracy and explainability.

Recent advancements, such as HalluMeasure , have introduced CoT-based approaches for decomposing responses into atomic claims, followed by hallucination classification and subtype labeling. However, financial hallucinations often involve complex entity attributions, regulatory terminology, and predictive extrapolations. Existing tools like RefChecker and Vectara HHEM lack domain specialization, underscoring the need for a financial-specific solution. (Akbar et al., 2024)

3. HalluFin Framework

3.1 Claim Extraction

HalluFin employs a fine-tuned version of Claude 2.1 to extract financial atomic claims (Xiaxoue et al) using few-shot CoT prompting. Training data includes:

10-K and 10-Q reports
Analyst summaries
Financial news articles

Example:

Response: "Tesla’s Q1 earnings surged due to EV tax credits and strong FSD revenue in China." Extracted Claims: Tesla’s Q1 earnings surged. The surge was due to EV tax credits. FSD revenue in China contributed significantly to the surge.

3.2 Claim Classification

Each claim is categorized as:

Supported
Partially Supported
Absent
Contradicted
Unevaluatable

Additionally, HalluFin identifies 10 hallucination subtypes adapted for finance:

Temporal drift (e.g., projecting future trends without grounding)
Projection exaggeration
Regulatory misstatement
Market misattribution
Unsupported causality
Over-attribution to entities
Data fabrication
Numerical inconsistency
Misleading comparisons
Context omission

Classification is performed using Claude 3 Sonnet and Mistral Large, both with and without CoT prompting.

3.3 Aggregation and Scoring

HalluFin computes:

Response-Level Hallucination Rate (HLR)
Claim-Level Hallucination F1 Score
Subtype Distribution Heatmaps
Severity Scores based on financial impact heuristics

4. Datasets

4.1 FinNewsSumm

100 finance-focused news articles from Bloomberg and Reuters
Each summary is LLM-generated (GPT-4, Claude, Mistral)
Human-annotated: ~600 atomic claims
Annotator agreement (Cohen's Kappa): 0.84

4.2 FinReportQA

Generated QA pairs from SEC filings (10-Ks, earnings calls)
Questions answered by LLMs
Reference: original report section
~1,200 claims, annotated for hallucination subtype

5. Experiments

5.1 Model Comparisons

Model	Precision	Recall	F1
Vectara HHEM	0.63	0.35	0.45
RefChecker (Claude)	0.69	0.71	0.70
HalluFin w/o CoT	0.76	0.73	0.74
HalluFin w/ CoT	0.90	0.79	0.82

5.2 Subtype Detection Performance

Subtype	Precision	Recall	F1
Projection exaggeration	0.83	0.75	0.79
Regulatory misstatement	0.89	0.81	0.85
Market misattribution	0.70	0.58	0.63
Temporal drift	0.78	0.82	0.80
Over-attribution to entities	0.65	0.55	0.59

5.3 Latency vs Accuracy Tradeoff

	Latency (s)	F1
One-claim eval, no CoT	25.6	0.74
One-claim eval, with CoT	49.3	0.82
All-claims eval, with CoT	14.1	0.76

6. Case Studies: Real-World Hallucination Detection with Source-Backed Claims

We now demonstrate HalluFin on key finance-related LLM outputs, verifying each factual claim using reputable sources like SEC filings, company earnings transcripts, Bloomberg, Reuters, and financial statements.

6.1 Tesla Q4 2024 Earnings Summary Audit

LLM Output:

“Tesla’s Q4 revenue surpassed $35 billion, driven by its dominance in China’s autonomous vehicle market and unprecedented growth in battery technology adoption across Asia.”

Verified Reference:

Tesla Q4 2024 earnings: $25.17 billion in revenue (Tesla Investor Relations).
Tesla’s China AV market share is growing but does not dominate; Baidu, XPeng, and AutoX lead autonomous pilots in China (Reuters, Dec 2024).
No earnings reference was made to battery adoption in Asia.

HalluFin Analysis:

Claim	Classification	Source Reference
Revenue surpassed $35B	Contradicted	Tesla Q4 ER, $25.17B
Dominance in China AV	Contradicted	Reuters, Dec 2024
Battery tech growth in Asia	Absent	No mention in ER

6.2 JPMorgan 2023 ESG Performance Summary

LLM Output:

“JPMorgan’s ESG compliance in 2023 was praised by the SEC, and it led the US banking sector in renewable bond issuance, doubling its previous record.”

Verified Reference:

The SEC does not publicly issue praise to specific firms for ESG compliance (SEC website).
JPMorgan participated in green bond issuance but was not the leader—Goldman Sachs led ESG debt underwriting in 2023 (BloombergNEF).
JPMorgan's own 2023 ESG Report does not claim a 2x record.

HalluFin Analysis:

Claim	Classification	Source Reference
SEC praised ESG efforts	Contradicted	SEC Public Statements
Led green bond issuance	Contradicted	BloombergNEF ESG rankings 2023
Doubled previous record	Unevaluable	Not reported in JPM ESG Report

6.3 Nvidia Generative AI Hardware Report

LLM Output:

“Nvidia’s report confirms its generative AI chips power 90% of global AI workloads.”

Verified Reference:

No such claim exists in Nvidia’s Q3 2024 earnings or investor deck (Nvidia IR).
IDC reports suggest Nvidia has ~80% share of AI accelerator market, but not 90% of all workloads (IDC AI Infrastructure Tracker, 2024).

HalluFin Analysis:

Claim	Classification	Source Reference
90% workload coverage	Contradicted	IDC 2024 AI Tracker
Claim confirmed in report	Contradicted	Nvidia Q3 2024 IR

6.4 Biotech ETF Commentary (IBB)

LLM Output:

“The iShares Biotech ETF (IBB) is projected to grow 45% YoY in 2025, due to weight-loss drug approvals and WHO-backed clinical trials.”

Verified Reference:

No official projection from BlackRock or analysts supports a 45% YoY rise (BlackRock IBB Factsheet).
Weight-loss drug approvals are led by FDA; WHO does not fund clinical trials, only issues guidelines (WHO 2024 Clinical Funding Policy).
Ozempic/Wegovy approvals drive biotech sentiment but not specific ETF forecasts.

HalluFin Analysis:

Claim	Classification	Source Reference
45% YoY projection	Unevaluable	No published forecast
WHO funding trials	Contradicted	WHO funding guidelines
Weight-loss drug boost	Partially Supported	Biotech investor commentary (Bloomberg, Oct 2024)

6.5 Apple AVP Sales & Stock Reaction (Assistant Chat)

LLM Output:

“Apple’s stock rose due to Apple Vision Pro (AVP) record-breaking sales of 5 million units in its first week.”

Verified Reference:

Apple has not disclosed AVP sales (Apple Q1 2025 earnings).
Third-party estimates from TF Securities: ~200k units sold in first month (MacRumors, Jan 2025).
Stock did rise post-launch, but analysts attributed it to services revenue and iPhone growth.

HalluFin Analysis:

Claim	Classification	Source Reference
5M AVP units sold	Contradicted	TF Securities / MacRumors
Stock rise due to AVP	Partially Supported	Morgan Stanley Apple Q1 Note

Summary Table: Hallucination Detection with Source Attribution

Case	Key Hallucination	Classification	Source
Tesla Q4	Revenue + China dominance	Contradicted	Tesla IR, Reuters
JPMorgan ESG	SEC praise + issuance lead	Contradicted	SEC, BloombergNEF
Nvidia AI Chips	90% claim	Contradicted	IDC, Nvidia
Biotech ETF	WHO funds trials	Contradicted	WHO policy
Apple Vision Pro	5M units claim	Contradicted	MacRumors, TF Securities

7. Conclusion and Future Work

HalluFin demonstrates that financial hallucination detection requires domain-specific prompts, CoT reasoning, and subtype annotation. It significantly outperforms existing benchmarks in accuracy while offering interpretability and granularity necessary for enterprise adoption.

Future directions include:

Incorporating tabular hallucination detection (e.g., financial ratios)
Integrating into real-time financial assistant products
Evaluating mitigation strategies (e.g., reranking generations by hallucination likelihood)

References

Akbar, S. A., Hossain, M. M., Wood, T., Chin, S.-C., Salinas, E., Alvarez, V., & Cornejo, E. (2024). HalluMeasure: Fine-grained Hallucination Measurement Using Chain-of-Thought Reasoning. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 15020–15037.
Honovich, J., et al. (2022). TRUE: Benchmark for Factual Consistency. arXiv preprint arXiv:2204.04991.
Zha, S., et al. (2023). AlignScore. ACL 2023.
Min, S., et al. (2023). FActScore. EMNLP 2023.
QFINTEC Research Labs. (2025). Internal LLM Evaluation Logs.
Huang et al., 2023. A Survey on Hallucination in LLMs. arXiv:2311.05232
Wang et al., 2024. Factuality of LLMs in 2024. arXiv:2402.02420
Rawte et al., 2023. The Troubling Emergence of Hallucination in LLMs. arXiv:2310.04988
Li et al., 2023. HaluEval Benchmark. arXiv:2305.11747
Lin et al., 2022. TruthfulQA. arXiv:2109.07958
Tam et al., 2023. Evaluating Factual Consistency. arXiv:2211.08412
Zha et al., 2023. AlignScore. ACL
Chern et al., 2023. FActool. arXiv:2307.13528
Min et al., 2023. FactScore. EMNLP
Manakul et al., 2023. SelfCheckGPT. arXiv:2303.08896
Mündler et al., 2024. Self-Contradictory Hallucinations. arXiv:2305.15852
Gekhman et al., 2023. TRUEteacher. arXiv:2305.11171
Krysciński et al., 2019. Factual Consistency Evaluation. arXiv:1910.12840
Honovich et al., 2022. TRUE Benchmark. arXiv:2204.04991
Gabriel et al., 2021. Go Figure Meta-evaluation. arXiv:2010.12834
See et al., 2017. CNN/DailyMail Dataset. ACL
Lebret et al., 2016. WikiBio Dataset. EMNLP
Tang et al., 2023/2024. Hallucination Error Taxonomy. arXiv:2205.12854
Zhu et al., 2023. Fine-grained Factual Errors. arXiv:2305.16548

QFINTEC

HalluFin: Fine-Grained Hallucination Detection in Financial Texts Using Chain-of-Thought Reasoning

Abstract

1. Introduction

2. Related Work

Datasets and Benchmarks

Evaluation Metrics

Granularity of Detection

Hallucination Typologies

Chain-of-Thought Reasoning for Hallucination Detection

3. HalluFin Framework

3.1 Claim Extraction

3.2 Claim Classification

3.3 Aggregation and Scoring

4. Datasets

4.1 FinNewsSumm

4.2 FinReportQA

5. Experiments

5.1 Model Comparisons

5.2 Subtype Detection Performance

5.3 Latency vs Accuracy Tradeoff

6. Case Studies: Real-World Hallucination Detection with Source-Backed Claims

6.1 Tesla Q4 2024 Earnings Summary Audit

6.2 JPMorgan 2023 ESG Performance Summary

6.3 Nvidia Generative AI Hardware Report

6.4 Biotech ETF Commentary (IBB)

6.5 Apple AVP Sales & Stock Reaction (Assistant Chat)

Summary Table: Hallucination Detection with Source Attribution

7. Conclusion and Future Work

References

Recent Posts

Comments