UNDERSTANDING THE IMPACT OF EXPERIMENTAL DESIGN CHOICES ON MACHINE LEARNING CLASSIFIERS IN SOFTWARE ANALYTICS

Loading...
Thumbnail Image

Authors

Rajbahadur, Gopi

Date

Type

thesis

Language

eng

Keyword

Machine learning , Software engineering , Data mining , Software analytics , mining software repositories , Defect prediction , Explainable machine learning

Research Projects

Organizational Units

Journal Issue

Alternative Title

Abstract

Software analytics is the process of systematically analyzing software engineering related data to generate actionable insights that help software practitioners make data-driven decisions. Machine learning classifiers lie at the heart of these software analytics pipelines and help automate the process of generating insights from large volumes of low-level software engineering data (e.g., static code metrics of software projects). However, the generated results from these classifiers are extremely sensitive to the various experimental design choices (e.g., choice of feature removal techniques) that one makes when constructing a software analytics pipeline. Despite that prior studies only explore the impact of a few experimental design choices on the results of classifiers and, the impact of many other experimental design choices on generated results remains unexplored. It is critical to further understand how the various experimental design choices impact the generated insights of a classifier. Such an understanding enables us to ensure the accuracy and validity of the generated insights from a classifier. Therefore, in this PhD thesis, we further our understanding of how several previously unexplored experimental design choices impact the results that are generated by a classifier. Through several case studies on various software analytics datasets and contexts, 1) we find that the common practice of discretizing the dependent feature could be avoided in some cases (where the defective ratio of the dataset is <15%) by using regression-based classifiers. 2) In cases where the discretization of the dependent feature cannot be avoided, we propose a framework that the researchers and practitioners can use to mitigate its impact on the generated insights of a classifier. 3) We find that interchangeable use of feature importance methods should be avoided as different feature importance methods produce vastly different interpretations even on the same classifier. Based on these findings we provide several guidelines for future software analytics studies.

Description

Citation

Publisher

License

Queen's University's Thesis/Dissertation Non-Exclusive License for Deposit to QSpace and Library and Archives Canada
ProQuest PhD and Master's Theses International Dissemination Agreement
Intellectual Property Guidelines at Queen's University
Copying and Preserving Your Thesis
This publication is made available by the authority of the copyright owner solely for the purpose of private study and research and may not be copied or reproduced except as permitted by the copyright laws without written authority from the copyright owner.
Attribution-ShareAlike 3.0 United States

Journal

Volume

Issue

PubMed ID

External DOI

ISSN

EISSN