Measuring Interestingness of Documents Using Variability
Kondi Chandrasekaran, Pradeep Kumar
MetadataShow full item record
The amount of data we are dealing with is being generated at an astronomical pace. With the rapid technological advances in the field of data storage techniques, storing and transmitting copious amounts of data has become very easy and hassle-free. However, exploring those abundant data and finding the interesting ones has always been a huge integral challenge and cumbersome process to people in all industrial sectors. A model to rank data by interest will help in saving the time spent on the large amount of data. In this research we concentrate specifically on ranking the text documents in corpora according to ``interestingness'' We design a state-of-the-art empirical model to rank documents according to ``interestingness''. The model is cost-efficient, fast and automated to an extent which requires minimal human intervention. We identify different categories of documents based on the word-usage pattern which in turn classifies them as being interesting, mundane or anomalous documents. The model is a novel approach which does not depend on the semantics of the words used in the document but is based on the repetition of words and rate of introduction of new words in the document. The model is a generic design which can be applied to a document corpus of any size from any domain. The model can be used to rank new documents introduced into the corpus. We formulate a couple of normalization techniques which can be used to neutralize the impact of variable document length. We use three approaches, namely dictionary-based data compression, analysis of the rate of new word occurrences and Singular Value Decomposition (SVD). To test the model we use a variety of corpora namely: US Diplomatic Cable releases by Wikileaks, US Presidents State of Union Addresses, Open American National Corpus and 20 Newsgroups articles. The techniques have various pre-processing steps which are totally automated. We compare the results of the three techniques and examine the level of agreement between pair of techniques using a statistical method called the Jaccard coefficient. This approach can also be used to detect the unusual and anomalous documents within the corpus. The results also contradict the assumptions made by Simon and Yule in deriving an equation for a general text generation model.