Revisiting Fraud Detection From The Language Of Financial Reports
Loading...
Authors
Velloor Sivasubramanian, Sachin
Date
Type
Language
eng
Keyword
Machine Learning , Deep Learning , Natural Language Processing , Financial Statements , Financial Statement Fraud , Data Analytics , Text Mining , Fraud Detection
Alternative Title
Abstract
Financial statement fraud has well-documented adverse effects on investors and the broader economy. The seriousness of this issue has emphasized the necessity for advanced detection methods, and one promising approach is utilizing machine learning to analyze qualitative disclosures in corporate financial reports, particularly the Management's Discussion and Analysis (MD&A) section. This research investigates the MD&A section's natural language to detect fraud, approaching the objective from 2 angles. The first research question tests whether deep learning techniques are more effective than established machine learning methods for detecting fraud within the MD&A section. To achieve this, multiple text classification experiments were conducted, each centered on a distinct neural network architecture. The best performance was obtained by a Transformer model that achieved a new state-of-the-art F1-score of 77% and an accuracy of 91%; but its results showed no substantial improvement over those of established machine learning methods, suggesting that deep learning may not be the expected superior solution for this problem. The second research question addresses a significant knowledge gap: how perpetrators of financial statement fraud might unknowingly reveal evidence of their actions within the MD&A section, despite the section’s writing process being a collaborative effort by multiple participants. To achieve this, machine learning and deep learning techniques were employed to reverse engineer the MD&A section’s language at 3 structural levels: word, sentence, and document. The word-level analysis explored whether the frequency of specific words correlated with fraudulent behavior in the MD&A section. The sentence-level analysis, which incorporated a custom data labeling mechanism, examined whether the usage of specific sentences or sentence structures was more likely to be associated with fraudulent behavior. Lastly, the document-level analysis investigated whether broader, aggregate factors were responsible. All experiments yielded strong performances; however, the most plausible explanation came from a stacked predictor at the document level that utilized a meta-like set of attributes. It revealed that a perpetrator could unintentionally disclose fraud by altering just a handful of sentences, omitting information thus creating shorter MD&As, using a certain number of high-risk words reflecting lower coverage of topics, and randomly positioning such sentence alterations within the MD&A section.