Empirical Evaluation of Graph-Anonymized Metrics for JIT Defect Prediction
Loading...
Authors
Malik, Akshat
Date
2024-04-23
Type
thesis
Language
eng
Keyword
Graph Anonymization , Privacy , Software Defect Prediction , Knowledge Graphs
Alternative Title
Abstract
Software analytics data, housed in version control and issue report repositories, serves pivotal roles in organizational tasks like predicting code bugs and selecting code reviewers. Yet, individual organizations often lack the breadth of data needed for these analytics, making data sharing essential. However, privacy concerns impede such sharing, as it could expose sensitive organizational information or compromise product quality. Additionally, the risk of reverse-engineering training data from shared models further complicates data privacy.
To address these challenges, anonymization techniques like MORPH, LACE, and LACE2 provide privacy to defect prediction data. While effective, they often sacrifice metric performance due to a disregard for data relationships during anonymization.
To preserve these relationships, we propose graph anonymization techniques. These methods maintain data connections while ensuring privacy. Our research evaluates four specific graph anonymization methods—Random Add/Delete, Random Switch, k-DA, and Generalization—across six large software projects. We assess privacy using the Increased Privacy Ratio (IPR) metric, finding that each method achieves privacy scores above 65%, with Random Add/Delete and Random Switch exceeding 80% for all projects.
Within-project analyses reveal minimal impacts on model performance, with median reductions of 1.45% in AUC, 5.35% in Recall, 2.29% in G-Mean, and 4.42% in FPR when privacy exceeds 65%. However, when privacy scores surpass 80%, performance declines further. Notably, graph anonymization outperforms tabular methods like MORPH and LACE, which significantly diminish model performance.
In cross-project analyses, graph anonymization maintains model performance with marginal reductions in metrics compared to non-graph methods. Combining graph with non-graph anonymization techniques mitigates performance declines observed with non-graph methods alone, highlighting the efficacy of graph anonymization in preserving privacy while sustaining model performance.
Our study underscores the effectiveness of graph anonymization in maintaining privacy without compromising model performance across within-project and cross-project settings. Even when employing diverse anonymization strategies, graph anonymization techniques consistently deliver robust performance, showcasing their viability in real-world applications.
Description
Citation
Publisher
License
Queen's University's Thesis/Dissertation Non-Exclusive License for Deposit to QSpace and Library and Archives Canada
ProQuest PhD and Master's Theses International Dissemination Agreement
Intellectual Property Guidelines at Queen's University
Copying and Preserving Your Thesis
This publication is made available by the authority of the copyright owner solely for the purpose of private study and research and may not be copied or reproduced except as permitted by the copyright laws without written authority from the copyright owner.
ProQuest PhD and Master's Theses International Dissemination Agreement
Intellectual Property Guidelines at Queen's University
Copying and Preserving Your Thesis
This publication is made available by the authority of the copyright owner solely for the purpose of private study and research and may not be copied or reproduced except as permitted by the copyright laws without written authority from the copyright owner.