Software Defect Prediction Using Rich Contextualized Language Use Vectors
Context. Software defect prediction aims to find defect prone source code, and thus reduce the effort, time and cost involved with ensuring the quality of software systems. Both code and non-code metrics are commonly used in this process to train machine learning algorithms to predict software defects. Studies have shown that such metrics-based approaches are failing to give satisfactory results, and have reached a performance ceiling. This thesis explores the idea of using code profiles as an alternative to traditional metrics to predict software defects. This code profile-based method proves to be more promising than traditional metrics-based approaches. Aims. This thesis aims to improve software defect prediction using code profiles as feature variables in place of traditional metrics. Software code profiles encode the density of language feature use and the context of such use in Rich Contextualized Language Use Vectors (RCLUVs) by analysing the parse tree of the source code. This thesis explores whether code profiles can be used to train machine learning algorithms, and compares the performance of the derived models to traditional metrics-based approaches. Methods. To achieve these aims the learning curves of several machine learning algorithms are analyzed, and the performance of the derived models are evaluated against traditional metrics-based approaches. Two benchmark bug datasets, the Eclipse bug dataset and the Github bug database, are used to train the models. Results. The learning curves of the models show machine learning algorithms can learn from RCLUV-based code profiles. Performance evaluation against existing metrics-based approaches reveals that the code profile-based approach is more promising than traditional metrics-based approaches. However, the predictive performance of both metrics and code profile-based approaches drops in cross-version predictions. Conclusions. Unlike traditional metrics-based approaches, this thesis uses vectors generated by analyzing language feature use from the parse trees of source code as feature variables to train machine learning algorithms. Experimental results using learning algorithms encourages us to use software code profiles as an alternative to traditional metrics to predict software defects.