Software Defect Prediction Using Rich Contextualized Language Use Vectors

Loading...
Thumbnail Image

Authors

Rahman, Ashiqur

Date

Type

thesis

Language

eng

Keyword

Software Code Profile , Rich Contextualized Language Use Vectors , Machine Learning , Deep Learning , Software Bug Prediction , Software Defect Prediction

Research Projects

Organizational Units

Journal Issue

Alternative Title

Abstract

Context. Software defect prediction aims to find defect prone source code, and thus reduce the effort, time and cost involved with ensuring the quality of software systems. Both code and non-code metrics are commonly used in this process to train machine learning algorithms to predict software defects. Studies have shown that such metrics-based approaches are failing to give satisfactory results, and have reached a performance ceiling. This thesis explores the idea of using code profiles as an alternative to traditional metrics to predict software defects. This code profile-based method proves to be more promising than traditional metrics-based approaches. Aims. This thesis aims to improve software defect prediction using code profiles as feature variables in place of traditional metrics. Software code profiles encode the density of language feature use and the context of such use in Rich Contextualized Language Use Vectors (RCLUVs) by analysing the parse tree of the source code. This thesis explores whether code profiles can be used to train machine learning algorithms, and compares the performance of the derived models to traditional metrics-based approaches. Methods. To achieve these aims the learning curves of several machine learning algorithms are analyzed, and the performance of the derived models are evaluated against traditional metrics-based approaches. Two benchmark bug datasets, the Eclipse bug dataset and the Github bug database, are used to train the models. Results. The learning curves of the models show machine learning algorithms can learn from RCLUV-based code profiles. Performance evaluation against existing metrics-based approaches reveals that the code profile-based approach is more promising than traditional metrics-based approaches. However, the predictive performance of both metrics and code profile-based approaches drops in cross-version predictions. Conclusions. Unlike traditional metrics-based approaches, this thesis uses vectors generated by analyzing language feature use from the parse trees of source code as feature variables to train machine learning algorithms. Experimental results using learning algorithms encourages us to use software code profiles as an alternative to traditional metrics to predict software defects.

Description

Citation

Publisher

License

Queen's University's Thesis/Dissertation Non-Exclusive License for Deposit to QSpace and Library and Archives Canada
ProQuest PhD and Master's Theses International Dissemination Agreement
Intellectual Property Guidelines at Queen's University
Copying and Preserving Your Thesis
This publication is made available by the authority of the copyright owner solely for the purpose of private study and research and may not be copied or reproduced except as permitted by the copyright laws without written authority from the copyright owner.

Journal

Volume

Issue

PubMed ID

External DOI

ISSN

EISSN