top of page
Search

HalluFin: Fine-Grained Hallucination Detection in Financial Texts Using Chain-of-Thought Reasoning


Author: Tarun Bhatia, QFINTEC




Abstract

Large Language Models (LLMs) are increasingly utilized in financial analysis for tasks such as summarization, recommendation generation, and automated report drafting. However, their propensity to hallucinate—producing plausible yet factually incorrect statements—poses significant risks in high-stakes financial environments. This paper introduces HalluFin, a fine-grained hallucination detection system tailored to the financial domain. Building upon the HalluMeasure framework, originally designed for general summarization tasks, HalluFin integrates domain-specific datasets, financial ontologies, and expert-in-the-loop evaluations. Utilizing chain-of-thought (CoT) prompting, HalluFin achieves accurate claim verification and subtype error classification. Experiments on two curated datasets—FinNewsSumm and FinReportQA—demonstrate that HalluFin outperforms baseline models by over 12 points in F1 score, offering actionable insights on hallucination types unique to financial documents, such as "Temporal drift," "Over-attribution to entities," and "Projection-based overgeneralization."​



1. Introduction

The integration of LLMs like GPT-4, Claude, and Mistral into financial workflows—from earnings summaries to portfolio commentaries—has introduced efficiency gains. However, the inherent risk of hallucinations in LLM outputs can lead to non-compliance, reputational damage, and market manipulation. Unlike consumer applications where hallucinations may be benign, financial outputs demand verifiability and traceability. This necessitates an automated hallucination detection pipeline that is both fine-grained and interpretable.​



2. Related Work

Hallucination in large language models (LLMs) has become a rapidly growing area of research due to its implications for model trustworthiness, particularly in high-stakes domains like healthcare, law, and finance. Hallucination is broadly defined as the generation of content that is unfaithful or not grounded in source data—a critical risk when LLMs are used to analyze or summarize factual documents.

Several survey papers offer a comprehensive overview of hallucination phenomena, categorizing types, causes, and mitigation strategies (Huang et al., 2023; Wang et al., 2024; Rawte et al., 2023). These surveys differentiate between intrinsic hallucinations (contradicting known facts) and extrinsic hallucinations (fabricating plausible but unverifiable information).

Datasets and Benchmarks

A number of works have created hallucination evaluation datasets, including TruthfulQA (Lin et al., 2022), HallucEval (Li et al., 2023), and datasets focused on summarization like TechNewsSumm (Akbar et al., 2024) and WikiBio (Lebret et al., 2016). Many of these target either news domains (e.g., CNN/DailyMail; See et al., 2017) or structured data tasks such as biography generation. However, finance-specific hallucination datasets remain scarce, motivating our development of FinNewsSumm and FinReportQA.

Evaluation Metrics

Various automatic hallucination metrics have been proposed:

  • Pretrained NLI-based metrics (Kryscinski et al., 2020; Laban et al., 2021) use entailment classification between source and generated text.

  • QA-based evaluation methods generate questions from model outputs and attempt to answer them from the reference (Chern et al., 2023; Min et al., 2023).

  • Consistency scoring methods like AlignScore (Zha et al., 2023) and Vectara-HHEM rely on embedding similarity and trained alignment classifiers.

  • More recent models like RefChecker (Hu et al., 2023) and SelfCheckGPT (Manakul et al., 2023) use large LLMs to validate their own outputs.

Meta-evaluation frameworks like TRUE (Honovich et al., 2022) and GoFigure (Gabriel et al., 2021) compare these metrics' correlation with human annotations.

Granularity of Detection

Hallucination evaluation approaches differ in granularity:

  • Response-level detection is efficient but coarse (e.g., Vectara-HHEM, AlignScore).

  • Sentence-level classification is more precise (Kryscinski et al., 2020).

  • Claim-level analysis, as introduced by Min et al. (2023) and expanded in HalluMeasure (Akbar et al., 2024), allows for fine-grained identification of factual inconsistencies. Our work builds on this by introducing domain-specific subtypes for financial texts (e.g., regulatory misstatement, unsupported causality).

Hallucination Typologies

Several efforts have categorized hallucinations into error taxonomies:

  • Rawte et al. (2023) outline types like negation, entity swap, or exaggerated reasoning.

  • Tang et al. (2023, 2024) and Zhu et al. (2023) propose fine-grained typologies across different domains, including dialogue and summarization.

Our work extends this line of research by proposing ten hallucination subtypes specific to finance, such as projection exaggeration and temporal drift, which we evaluate in real-world applications like earnings summaries and ETF commentary.

Chain-of-Thought Reasoning for Hallucination Detection

Chain-of-Thought (CoT) prompting (Wei et al., 2022) has shown significant improvements in reasoning-heavy tasks and is now applied to hallucination detection. domain prompts, improving both accuracy and explainability.

Recent advancements, such as HalluMeasure ​, have introduced CoT-based approaches for decomposing responses into atomic claims, followed by hallucination classification and subtype labeling. However, financial hallucinations often involve complex entity attributions, regulatory terminology, and predictive extrapolations. Existing tools like RefChecker and Vectara HHEM lack domain specialization, underscoring the need for a financial-specific solution. (​Akbar et al., 2024)




3. HalluFin Framework


3.1 Claim Extraction

HalluFin employs a fine-tuned version of Claude 2.1 to extract financial atomic claims  (​Xiaxoue et al) using few-shot CoT prompting. Training data includes:​

  • 10-K and 10-Q reports

  • Analyst summaries

  • Financial news articles

Example:

Response: "Tesla’s Q1 earnings surged due to EV tax credits and strong FSD revenue in China." Extracted Claims: Tesla’s Q1 earnings surged. The surge was due to EV tax credits. FSD revenue in China contributed significantly to the surge.​

3.2 Claim Classification


Each claim is categorized as:​

  • Supported

  • Partially Supported

  • Absent

  • Contradicted

  • Unevaluatable

Additionally, HalluFin identifies 10 hallucination subtypes adapted for finance:​

  • Temporal drift (e.g., projecting future trends without grounding)

  • Projection exaggeration

  • Regulatory misstatement

  • Market misattribution

  • Unsupported causality

  • Over-attribution to entities

  • Data fabrication

  • Numerical inconsistency

  • Misleading comparisons

  • Context omission


Classification is performed using Claude 3 Sonnet and Mistral Large, both with and without CoT prompting.​


3.3 Aggregation and Scoring

HalluFin computes:​

  • Response-Level Hallucination Rate (HLR)

  • Claim-Level Hallucination F1 Score

  • Subtype Distribution Heatmaps

  • Severity Scores based on financial impact heuristics

4. Datasets

4.1 FinNewsSumm

  • 100 finance-focused news articles from Bloomberg and Reuters

  • Each summary is LLM-generated (GPT-4, Claude, Mistral)

  • Human-annotated: ~600 atomic claims

  • Annotator agreement (Cohen's Kappa): 0.84

4.2 FinReportQA

  • Generated QA pairs from SEC filings (10-Ks, earnings calls)

  • Questions answered by LLMs

  • Reference: original report section

  • ~1,200 claims, annotated for hallucination subtype​



5. Experiments

5.1 Model Comparisons

Model

Precision

Recall

F1

Vectara HHEM

0.63

0.35

0.45

RefChecker (Claude)

0.69

0.71

0.70

HalluFin w/o CoT

0.76

0.73

0.74

HalluFin w/ CoT

0.90

0.79

0.82

5.2 Subtype Detection Performance

Subtype

Precision

Recall

F1

Projection exaggeration

0.83

0.75

0.79

Regulatory misstatement

0.89

0.81

0.85

Market misattribution

0.70

0.58

0.63

Temporal drift

0.78

0.82

0.80

Over-attribution to entities

0.65

0.55

0.59

5.3 Latency vs Accuracy Tradeoff



Latency (s)

F1

One-claim eval, no CoT

25.6

0.74

One-claim eval, with CoT

49.3

0.82

All-claims eval, with CoT

14.1

0.76



6. Case Studies: Real-World Hallucination Detection with Source-Backed Claims

We now demonstrate HalluFin on key finance-related LLM outputs, verifying each factual claim using reputable sources like SEC filings, company earnings transcripts, Bloomberg, Reuters, and financial statements.


6.1 Tesla Q4 2024 Earnings Summary Audit

LLM Output:

“Tesla’s Q4 revenue surpassed $35 billion, driven by its dominance in China’s autonomous vehicle market and unprecedented growth in battery technology adoption across Asia.”

Verified Reference:

  • Tesla Q4 2024 earnings: $25.17 billion in revenue (Tesla Investor Relations).

  • Tesla’s China AV market share is growing but does not dominate; Baidu, XPeng, and AutoX lead autonomous pilots in China (Reuters, Dec 2024).

  • No earnings reference was made to battery adoption in Asia.

HalluFin Analysis:

Claim

Classification

Source Reference

Revenue surpassed $35B

Contradicted

Tesla Q4 ER, $25.17B

Dominance in China AV

Contradicted

Reuters, Dec 2024

Battery tech growth in Asia

Absent

No mention in ER

6.2 JPMorgan 2023 ESG Performance Summary

LLM Output:

“JPMorgan’s ESG compliance in 2023 was praised by the SEC, and it led the US banking sector in renewable bond issuance, doubling its previous record.”

Verified Reference:

  • The SEC does not publicly issue praise to specific firms for ESG compliance (SEC website).

  • JPMorgan participated in green bond issuance but was not the leader—Goldman Sachs led ESG debt underwriting in 2023 (BloombergNEF).

  • JPMorgan's own 2023 ESG Report does not claim a 2x record.


HalluFin Analysis:

Claim

Classification

Source Reference

SEC praised ESG efforts

Contradicted

SEC Public Statements

Led green bond issuance

Contradicted

BloombergNEF ESG rankings 2023

Doubled previous record

Unevaluable

Not reported in JPM ESG Report

6.3 Nvidia Generative AI Hardware Report


LLM Output:

“Nvidia’s report confirms its generative AI chips power 90% of global AI workloads.”

Verified Reference:

HalluFin Analysis:

Claim

Classification

Source Reference

90% workload coverage

Contradicted

IDC 2024 AI Tracker

Claim confirmed in report

Contradicted

Nvidia Q3 2024 IR


6.4 Biotech ETF Commentary (IBB)


LLM Output:

“The iShares Biotech ETF (IBB) is projected to grow 45% YoY in 2025, due to weight-loss drug approvals and WHO-backed clinical trials.”

Verified Reference:

  • No official projection from BlackRock or analysts supports a 45% YoY rise (BlackRock IBB Factsheet).

  • Weight-loss drug approvals are led by FDA; WHO does not fund clinical trials, only issues guidelines (WHO 2024 Clinical Funding Policy).

  • Ozempic/Wegovy approvals drive biotech sentiment but not specific ETF forecasts.


HalluFin Analysis:

Claim

Classification

Source Reference

45% YoY projection

Unevaluable

No published forecast

WHO funding trials

Contradicted

WHO funding guidelines

Weight-loss drug boost

Partially Supported

Biotech investor commentary (Bloomberg, Oct 2024)

6.5 Apple AVP Sales & Stock Reaction (Assistant Chat)


LLM Output:

“Apple’s stock rose due to Apple Vision Pro (AVP) record-breaking sales of 5 million units in its first week.”

Verified Reference:

  • Apple has not disclosed AVP sales (Apple Q1 2025 earnings).

  • Third-party estimates from TF Securities: ~200k units sold in first month (MacRumors, Jan 2025).

  • Stock did rise post-launch, but analysts attributed it to services revenue and iPhone growth.


HalluFin Analysis:

Claim

Classification

Source Reference

5M AVP units sold

Contradicted

TF Securities / MacRumors

Stock rise due to AVP

Partially Supported

Morgan Stanley Apple Q1 Note

Summary Table: Hallucination Detection with Source Attribution

Case

Key Hallucination

Classification

Source

Tesla Q4

Revenue + China dominance

Contradicted

Tesla IR, Reuters

JPMorgan ESG

SEC praise + issuance lead

Contradicted

SEC, BloombergNEF

Nvidia AI Chips

90% claim

Contradicted

IDC, Nvidia

Biotech ETF

WHO funds trials

Contradicted

WHO policy

Apple Vision Pro

5M units claim

Contradicted

MacRumors, TF Securities





7. Conclusion and Future Work

HalluFin demonstrates that financial hallucination detection requires domain-specific prompts, CoT reasoning, and subtype annotation. It significantly outperforms existing benchmarks in accuracy while offering interpretability and granularity necessary for enterprise adoption.​

Future directions include:

  • Incorporating tabular hallucination detection (e.g., financial ratios)

  • Integrating into real-time financial assistant products

  • Evaluating mitigation strategies (e.g., reranking generations by hallucination likelihood)​





References

  • Akbar, S. A., Hossain, M. M., Wood, T., Chin, S.-C., Salinas, E., Alvarez, V., & Cornejo, E. (2024). HalluMeasure: Fine-grained Hallucination Measurement Using Chain-of-Thought Reasoning. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 15020–15037.

  • Honovich, J., et al. (2022). TRUE: Benchmark for Factual Consistency. arXiv preprint arXiv:2204.04991.​

  • Zha, S., et al. (2023). AlignScore. ACL 2023.​

  • Min, S., et al. (2023). FActScore. EMNLP 2023.​

  • QFINTEC Research Labs. (2025). Internal LLM Evaluation Logs.​

  • Huang et al., 2023. A Survey on Hallucination in LLMs. arXiv:2311.05232

  • Wang et al., 2024. Factuality of LLMs in 2024. arXiv:2402.02420

  • Rawte et al., 2023. The Troubling Emergence of Hallucination in LLMs. arXiv:2310.04988

  • Li et al., 2023. HaluEval Benchmark. arXiv:2305.11747

  • Lin et al., 2022. TruthfulQA. arXiv:2109.07958

  • Tam et al., 2023. Evaluating Factual Consistency. arXiv:2211.08412

  • Zha et al., 2023. AlignScore. ACL

  • Chern et al., 2023. FActool. arXiv:2307.13528

  • Min et al., 2023. FactScore. EMNLP

  • Manakul et al., 2023. SelfCheckGPT. arXiv:2303.08896

  • Mündler et al., 2024. Self-Contradictory Hallucinations. arXiv:2305.15852

  • Gekhman et al., 2023. TRUEteacher. arXiv:2305.11171

  • Krysciński et al., 2019. Factual Consistency Evaluation. arXiv:1910.12840

  • Honovich et al., 2022. TRUE Benchmark. arXiv:2204.04991

  • Gabriel et al., 2021. Go Figure Meta-evaluation. arXiv:2010.12834

  • See et al., 2017. CNN/DailyMail Dataset. ACL

  • Lebret et al., 2016. WikiBio Dataset. EMNLP

  • Tang et al., 2023/2024. Hallucination Error Taxonomy. arXiv:2205.12854

  • Zhu et al., 2023. Fine-grained Factual Errors. arXiv:2305.16548

 
 
 

Comments


bottom of page