Modelling Deception Detection in Text
Deception Detection , Fraud , Text mining , Email , Enron
As organizations and government agencies work diligently to detect financial irregularities, malfeasance, fraud and criminal activities through intercepted communication, there is an increasing interest in devising an automated model/tool for deception detection. We build on Pennebaker's empirical model which suggests that deception in text leaves a linguistic signature characterised by changes in frequency of four categories of words: first-person pronouns, exclusive words, negative emotion words, and action words. By applying the model to the Enron email dataset and using an unsupervised matrix-decomposition technique, we explore the differential use of these cue-words/categories in deception detection. Instead of focusing on the predictive power of the individual cue-words, we construct a descriptive model which helps us to understand the multivariate profile of deception based on several linguistic dimensions and highlights the qualitative differences between deceptive and truthful communication. This descriptive model can not only help detect unusual and deceptive communication, but also possibly rank messages along a scale of relative deceptiveness (for instance from strategic negotiation and spin to deception and blatant lying). The model is unintrusive, requires minimal human intervention and, by following the defined pre-processing steps it may be applied to new datasets from different domains.