Show simple item record

dc.contributor.authorRakha, Mohamed
dc.contributor.otherQueen's University (Kingston, Ont.). Theses (Queen's University (Kingston, Ont.))en
dc.description.abstractIssue tracking systems, such as Bugzilla, are commonly used to track reported bugs and change requests. Duplicate reports have been considered as a hindrance to developers and a drain on their resources. To avoid wasting developer resources on previously-reported (i.e., duplicate) issues, it is necessary to identify such duplicates as soon as they are reported. In recent years, several approaches have been proposed for the automated retrieval of duplicate reports. These approaches leverage the textual, categorical, and contextual information in previously reported issues to determine whether a newly-reported issue has been previously-reported. In general, studies that are designed to evaluate these approaches treat all the duplicate issue reports equally, make use of data chunks that span a relatively short period of time, and ignore the impact of newly-activated features (e.g., just-in-time lightweight retrieval of duplicates at filing time) in the recent issue tracking systems. This thesis revisits the experimental design choices of such prior studies along three perspectives: 1) Used performance measures, 2) Evaluation process, and 3) Experiment's data choice. For the performance measures, we highlight the need for effort-aware evaluation of such approaches, since the identification of a considerable amount of duplicate reports (over 50%) appears to be a relatively trivial task. For the evaluation process, we show that the previously-reported performance of such approaches is significantly overestimated. Finally, recent versions of ITSs perform just-in-time lightweight retrieval of duplicate issue reports at the filing time of an issue report. The aim of such just-in-time retrieval is to avoid the filing of duplicates. We show that future studies of the automated retrieval of duplicate reports have to focus on after-JIT duplicates, as these duplicates are more representative of issue reports in practice nowadays. Our results through this thesis highlight the current state of progress in the automated retrieval of duplicate reports while charting directions for future research efforts.en_US
dc.relation.ispartofseriesCanadian thesesen
dc.rightsCC BY 4.0*
dc.rightsQueen's University's Thesis/Dissertation Non-Exclusive License for Deposit to QSpace and Library and Archives Canadaen
dc.rightsProQuest PhD and Master's Theses International Dissemination Agreementen
dc.rightsIntellectual Property Guidelines at Queen's Universityen
dc.rightsCopying and Preserving Your Thesisen
dc.rightsThis publication is made available by the authority of the copyright owner solely for the purpose of private study and research and may not be copied or reproduced except as permitted by the copyright laws without written authority from the copyright owner.en
dc.subjectText Analysisen_US
dc.subjectDuplicate Issue Reportsen_US
dc.subjectPerformance Evaluationen_US
dc.subjectSoftware Engineeringen_US
dc.subjectSoftware Issue Reportsen_US
dc.titleRevisiting the Experimental Design Choices for Approaches for the Automated Retrieval of Duplicate Issue Reportsen_US
dc.description.degreeDoctor of Philosophyen_US
dc.contributor.supervisorE. Hassan, Ahmed

Files in this item


This item appears in the following Collection(s)

Show simple item record

CC BY 4.0
Except where otherwise noted, this item's license is described as CC BY 4.0