Show simple item record

dc.contributor.authorThomas, Stephen
dc.contributor.otherQueen's University (Kingston, Ont.). Theses (Queen's University (Kingston, Ont.))en
dc.date2012-12-12 08:54:12.654en
dc.date2012-12-12 12:34:59.854en
dc.date.accessioned2012-12-12T19:20:03Z
dc.date.available2012-12-12T19:20:03Z
dc.date.issued2012-12-12
dc.identifier.urihttp://hdl.handle.net/1974/7688
dc.descriptionThesis (Ph.D, Computing) -- Queen's University, 2012-12-12 12:34:59.854en
dc.description.abstractMining Software Repositories, which is the process of analyzing the data related to software development practices, is an emerging field which aims to aid development teams in their day to day tasks. However, data in many software repositories is currently unused because the data is unstructured, and therefore difficult to mine and analyze. Information Retrieval (IR) techniques, which were developed specifically to handle unstructured data, have recently been used by researchers to mine and analyze the unstructured data in software repositories, with some success. The main contribution of this thesis is the idea that the research and practice of using IR models to mine unstructured software repositories can be improved by going beyond the current state of affairs. First, we propose new applications of IR models to existing software engineering tasks. Specifically, we present a technique to prioritize test cases based on their IR similarity, giving highest priority to those test cases that are most dissimilar. In another new application of IR models, we empirically recover how developers use their mailing list while developing software. Next, we show how the use of advanced IR techniques can improve results. Using a framework for combining disparate IR models, we find that bug localization performance can be improved by 14–56% on average, compared to the best individual IR model. In addition, by using topic evolution models on the history of source code, we can uncover the evolution of source code concepts with an accuracy of 87–89%. Finally, we show the risks of current research, which uses IR models as black boxes without fully understanding their assumptions and parameters. We show that data duplication in source code has undesirable effects for IR models, and that by eliminating the duplication, the accuracy of IR models improves. Additionally, we find that in the bug localization task, an unwise choice of parameter values results in an accuracy of only 1%, where optimal parameters can achieve an accuracy of 55%. Through empirical case studies on real-world systems, we show that all of our proposed techniques and methodologies significantly improve the state-of-the-art.en_US
dc.languageenen
dc.language.isoenen_US
dc.relation.ispartofseriesCanadian thesesen
dc.rightsThis publication is made available by the authority of the copyright owner solely for the purpose of private study and research and may not be copied or reproduced except as permitted by the copyright laws without written authority from the copyright owner.en
dc.subjectempirical studiesen_US
dc.subjectmining software repositoriesen_US
dc.subjectdata miningen_US
dc.subjectmachine learningen_US
dc.subjectsoftware engineeringen_US
dc.subjectinformation retrievalen_US
dc.titleMINING UNSTRUCTURED SOFTWARE REPOSITORIES USING IR MODELSen_US
dc.typethesisen_US
dc.description.degreePh.Den
dc.contributor.supervisorHassan, Ahmed E.en
dc.contributor.supervisorBlostein, Dorotheaen
dc.contributor.departmentComputingen


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record