HalluFin: Fine-Grained Hallucination Detection in Financial Texts Using Chain-of-Thought Reasoning
- Tarun Bhatia
- Apr 30
- 8 min read
Author: Tarun Bhatia, QFINTEC

Abstract
Large Language Models (LLMs) are increasingly utilized in financial analysis for tasks such as summarization, recommendation generation, and automated report drafting. However, their propensity to hallucinate—producing plausible yet factually incorrect statements—poses significant risks in high-stakes financial environments. This paper introduces HalluFin, a fine-grained hallucination detection system tailored to the financial domain. Building upon the HalluMeasure framework, originally designed for general summarization tasks, HalluFin integrates domain-specific datasets, financial ontologies, and expert-in-the-loop evaluations. Utilizing chain-of-thought (CoT) prompting, HalluFin achieves accurate claim verification and subtype error classification. Experiments on two curated datasets—FinNewsSumm and FinReportQA—demonstrate that HalluFin outperforms baseline models by over 12 points in F1 score, offering actionable insights on hallucination types unique to financial documents, such as "Temporal drift," "Over-attribution to entities," and "Projection-based overgeneralization."
1. Introduction
The integration of LLMs like GPT-4, Claude, and Mistral into financial workflows—from earnings summaries to portfolio commentaries—has introduced efficiency gains. However, the inherent risk of hallucinations in LLM outputs can lead to non-compliance, reputational damage, and market manipulation. Unlike consumer applications where hallucinations may be benign, financial outputs demand verifiability and traceability. This necessitates an automated hallucination detection pipeline that is both fine-grained and interpretable.
2. Related Work
Hallucination in large language models (LLMs) has become a rapidly growing area of research due to its implications for model trustworthiness, particularly in high-stakes domains like healthcare, law, and finance. Hallucination is broadly defined as the generation of content that is unfaithful or not grounded in source data—a critical risk when LLMs are used to analyze or summarize factual documents.
Several survey papers offer a comprehensive overview of hallucination phenomena, categorizing types, causes, and mitigation strategies (Huang et al., 2023; Wang et al., 2024; Rawte et al., 2023). These surveys differentiate between intrinsic hallucinations (contradicting known facts) and extrinsic hallucinations (fabricating plausible but unverifiable information).
Datasets and Benchmarks
A number of works have created hallucination evaluation datasets, including TruthfulQA (Lin et al., 2022), HallucEval (Li et al., 2023), and datasets focused on summarization like TechNewsSumm (Akbar et al., 2024) and WikiBio (Lebret et al., 2016). Many of these target either news domains (e.g., CNN/DailyMail; See et al., 2017) or structured data tasks such as biography generation. However, finance-specific hallucination datasets remain scarce, motivating our development of FinNewsSumm and FinReportQA.
Evaluation Metrics
Various automatic hallucination metrics have been proposed:
Pretrained NLI-based metrics (Kryscinski et al., 2020; Laban et al., 2021) use entailment classification between source and generated text.
QA-based evaluation methods generate questions from model outputs and attempt to answer them from the reference (Chern et al., 2023; Min et al., 2023).
Consistency scoring methods like AlignScore (Zha et al., 2023) and Vectara-HHEM rely on embedding similarity and trained alignment classifiers.
More recent models like RefChecker (Hu et al., 2023) and SelfCheckGPT (Manakul et al., 2023) use large LLMs to validate their own outputs.
Meta-evaluation frameworks like TRUE (Honovich et al., 2022) and GoFigure (Gabriel et al., 2021) compare these metrics' correlation with human annotations.
Granularity of Detection
Hallucination evaluation approaches differ in granularity:
Response-level detection is efficient but coarse (e.g., Vectara-HHEM, AlignScore).
Sentence-level classification is more precise (Kryscinski et al., 2020).
Claim-level analysis, as introduced by Min et al. (2023) and expanded in HalluMeasure (Akbar et al., 2024), allows for fine-grained identification of factual inconsistencies. Our work builds on this by introducing domain-specific subtypes for financial texts (e.g., regulatory misstatement, unsupported causality).
Hallucination Typologies
Several efforts have categorized hallucinations into error taxonomies:
Rawte et al. (2023) outline types like negation, entity swap, or exaggerated reasoning.
Tang et al. (2023, 2024) and Zhu et al. (2023) propose fine-grained typologies across different domains, including dialogue and summarization.
Our work extends this line of research by proposing ten hallucination subtypes specific to finance, such as projection exaggeration and temporal drift, which we evaluate in real-world applications like earnings summaries and ETF commentary.
Chain-of-Thought Reasoning for Hallucination Detection
Chain-of-Thought (CoT) prompting (Wei et al., 2022) has shown significant improvements in reasoning-heavy tasks and is now applied to hallucination detection. domain prompts, improving both accuracy and explainability.
Recent advancements, such as HalluMeasure , have introduced CoT-based approaches for decomposing responses into atomic claims, followed by hallucination classification and subtype labeling. However, financial hallucinations often involve complex entity attributions, regulatory terminology, and predictive extrapolations. Existing tools like RefChecker and Vectara HHEM lack domain specialization, underscoring the need for a financial-specific solution. (Akbar et al., 2024)
3. HalluFin Framework
3.1 Claim Extraction
HalluFin employs a fine-tuned version of Claude 2.1 to extract financial atomic claims (Xiaxoue et al) using few-shot CoT prompting. Training data includes:
10-K and 10-Q reports
Analyst summaries
Financial news articles
Example:
Response: "Tesla’s Q1 earnings surged due to EV tax credits and strong FSD revenue in China." Extracted Claims: Tesla’s Q1 earnings surged. The surge was due to EV tax credits. FSD revenue in China contributed significantly to the surge.
3.2 Claim Classification
Each claim is categorized as:
Supported
Partially Supported
Absent
Contradicted
Unevaluatable
Additionally, HalluFin identifies 10 hallucination subtypes adapted for finance:
Temporal drift (e.g., projecting future trends without grounding)
Projection exaggeration
Regulatory misstatement
Market misattribution
Unsupported causality
Over-attribution to entities
Data fabrication
Numerical inconsistency
Misleading comparisons
Context omission
Classification is performed using Claude 3 Sonnet and Mistral Large, both with and without CoT prompting.
3.3 Aggregation and Scoring
HalluFin computes:
Response-Level Hallucination Rate (HLR)
Claim-Level Hallucination F1 Score
Subtype Distribution Heatmaps
Severity Scores based on financial impact heuristics
4. Datasets
4.1 FinNewsSumm
100 finance-focused news articles from Bloomberg and Reuters
Each summary is LLM-generated (GPT-4, Claude, Mistral)
Human-annotated: ~600 atomic claims
Annotator agreement (Cohen's Kappa): 0.84
4.2 FinReportQA
Generated QA pairs from SEC filings (10-Ks, earnings calls)
Questions answered by LLMs
Reference: original report section
~1,200 claims, annotated for hallucination subtype
5. Experiments
5.1 Model Comparisons
Model | Precision | Recall | F1 |
Vectara HHEM | 0.63 | 0.35 | 0.45 |
RefChecker (Claude) | 0.69 | 0.71 | 0.70 |
HalluFin w/o CoT | 0.76 | 0.73 | 0.74 |
HalluFin w/ CoT | 0.90 | 0.79 | 0.82 |
5.2 Subtype Detection Performance
Subtype | Precision | Recall | F1 |
Projection exaggeration | 0.83 | 0.75 | 0.79 |
Regulatory misstatement | 0.89 | 0.81 | 0.85 |
Market misattribution | 0.70 | 0.58 | 0.63 |
Temporal drift | 0.78 | 0.82 | 0.80 |
Over-attribution to entities | 0.65 | 0.55 | 0.59 |
5.3 Latency vs Accuracy Tradeoff
Latency (s) | F1 | |
One-claim eval, no CoT | 25.6 | 0.74 |
One-claim eval, with CoT | 49.3 | 0.82 |
All-claims eval, with CoT | 14.1 | 0.76 |
6. Case Studies: Real-World Hallucination Detection with Source-Backed Claims
We now demonstrate HalluFin on key finance-related LLM outputs, verifying each factual claim using reputable sources like SEC filings, company earnings transcripts, Bloomberg, Reuters, and financial statements.
6.1 Tesla Q4 2024 Earnings Summary Audit
LLM Output:
“Tesla’s Q4 revenue surpassed $35 billion, driven by its dominance in China’s autonomous vehicle market and unprecedented growth in battery technology adoption across Asia.”
Verified Reference:
Tesla Q4 2024 earnings: $25.17 billion in revenue (Tesla Investor Relations).
Tesla’s China AV market share is growing but does not dominate; Baidu, XPeng, and AutoX lead autonomous pilots in China (Reuters, Dec 2024).
No earnings reference was made to battery adoption in Asia.
HalluFin Analysis:
Claim | Classification | Source Reference |
Revenue surpassed $35B | Contradicted | Tesla Q4 ER, $25.17B |
Dominance in China AV | Contradicted | Reuters, Dec 2024 |
Battery tech growth in Asia | Absent | No mention in ER |
6.2 JPMorgan 2023 ESG Performance Summary
LLM Output:
“JPMorgan’s ESG compliance in 2023 was praised by the SEC, and it led the US banking sector in renewable bond issuance, doubling its previous record.”
Verified Reference:
The SEC does not publicly issue praise to specific firms for ESG compliance (SEC website).
JPMorgan participated in green bond issuance but was not the leader—Goldman Sachs led ESG debt underwriting in 2023 (BloombergNEF).
JPMorgan's own 2023 ESG Report does not claim a 2x record.
HalluFin Analysis:
Claim | Classification | Source Reference |
SEC praised ESG efforts | Contradicted | SEC Public Statements |
Led green bond issuance | Contradicted | BloombergNEF ESG rankings 2023 |
Doubled previous record | Unevaluable | Not reported in JPM ESG Report |
6.3 Nvidia Generative AI Hardware Report
LLM Output:
“Nvidia’s report confirms its generative AI chips power 90% of global AI workloads.”
Verified Reference:
No such claim exists in Nvidia’s Q3 2024 earnings or investor deck (Nvidia IR).
IDC reports suggest Nvidia has ~80% share of AI accelerator market, but not 90% of all workloads (IDC AI Infrastructure Tracker, 2024).
HalluFin Analysis:
Claim | Classification | Source Reference |
90% workload coverage | Contradicted | IDC 2024 AI Tracker |
Claim confirmed in report | Contradicted | Nvidia Q3 2024 IR |
6.4 Biotech ETF Commentary (IBB)
LLM Output:
“The iShares Biotech ETF (IBB) is projected to grow 45% YoY in 2025, due to weight-loss drug approvals and WHO-backed clinical trials.”
Verified Reference:
No official projection from BlackRock or analysts supports a 45% YoY rise (BlackRock IBB Factsheet).
Weight-loss drug approvals are led by FDA; WHO does not fund clinical trials, only issues guidelines (WHO 2024 Clinical Funding Policy).
Ozempic/Wegovy approvals drive biotech sentiment but not specific ETF forecasts.
HalluFin Analysis:
Claim | Classification | Source Reference |
45% YoY projection | Unevaluable | No published forecast |
WHO funding trials | Contradicted | WHO funding guidelines |
Weight-loss drug boost | Partially Supported | Biotech investor commentary (Bloomberg, Oct 2024) |
6.5 Apple AVP Sales & Stock Reaction (Assistant Chat)
LLM Output:
“Apple’s stock rose due to Apple Vision Pro (AVP) record-breaking sales of 5 million units in its first week.”
Verified Reference:
Apple has not disclosed AVP sales (Apple Q1 2025 earnings).
Third-party estimates from TF Securities: ~200k units sold in first month (MacRumors, Jan 2025).
Stock did rise post-launch, but analysts attributed it to services revenue and iPhone growth.
HalluFin Analysis:
Claim | Classification | Source Reference |
5M AVP units sold | Contradicted | TF Securities / MacRumors |
Stock rise due to AVP | Partially Supported | Morgan Stanley Apple Q1 Note |
Summary Table: Hallucination Detection with Source Attribution
Case | Key Hallucination | Classification | Source |
Tesla Q4 | Revenue + China dominance | Contradicted | Tesla IR, Reuters |
JPMorgan ESG | SEC praise + issuance lead | Contradicted | SEC, BloombergNEF |
Nvidia AI Chips | 90% claim | Contradicted | IDC, Nvidia |
Biotech ETF | WHO funds trials | Contradicted | WHO policy |
Apple Vision Pro | 5M units claim | Contradicted | MacRumors, TF Securities |

7. Conclusion and Future Work
HalluFin demonstrates that financial hallucination detection requires domain-specific prompts, CoT reasoning, and subtype annotation. It significantly outperforms existing benchmarks in accuracy while offering interpretability and granularity necessary for enterprise adoption.
Future directions include:
Incorporating tabular hallucination detection (e.g., financial ratios)
Integrating into real-time financial assistant products
Evaluating mitigation strategies (e.g., reranking generations by hallucination likelihood)
References
Akbar, S. A., Hossain, M. M., Wood, T., Chin, S.-C., Salinas, E., Alvarez, V., & Cornejo, E. (2024). HalluMeasure: Fine-grained Hallucination Measurement Using Chain-of-Thought Reasoning. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 15020–15037.
Honovich, J., et al. (2022). TRUE: Benchmark for Factual Consistency. arXiv preprint arXiv:2204.04991.
Zha, S., et al. (2023). AlignScore. ACL 2023.
Min, S., et al. (2023). FActScore. EMNLP 2023.
QFINTEC Research Labs. (2025). Internal LLM Evaluation Logs.
Huang et al., 2023. A Survey on Hallucination in LLMs. arXiv:2311.05232
Wang et al., 2024. Factuality of LLMs in 2024. arXiv:2402.02420
Rawte et al., 2023. The Troubling Emergence of Hallucination in LLMs. arXiv:2310.04988
Li et al., 2023. HaluEval Benchmark. arXiv:2305.11747
Lin et al., 2022. TruthfulQA. arXiv:2109.07958
Tam et al., 2023. Evaluating Factual Consistency. arXiv:2211.08412
Zha et al., 2023. AlignScore. ACL
Chern et al., 2023. FActool. arXiv:2307.13528
Min et al., 2023. FactScore. EMNLP
Manakul et al., 2023. SelfCheckGPT. arXiv:2303.08896
Mündler et al., 2024. Self-Contradictory Hallucinations. arXiv:2305.15852
Gekhman et al., 2023. TRUEteacher. arXiv:2305.11171
Krysciński et al., 2019. Factual Consistency Evaluation. arXiv:1910.12840
Honovich et al., 2022. TRUE Benchmark. arXiv:2204.04991
Gabriel et al., 2021. Go Figure Meta-evaluation. arXiv:2010.12834
See et al., 2017. CNN/DailyMail Dataset. ACL
Lebret et al., 2016. WikiBio Dataset. EMNLP
Tang et al., 2023/2024. Hallucination Error Taxonomy. arXiv:2205.12854
Zhu et al., 2023. Fine-grained Factual Errors. arXiv:2305.16548
Comments