Leveraging Historical Code Changes to Support Clone Management Activities
Abstract
Code clones are code snippets that come into existence when developers copy paste (and possibly modify) an existing piece of code. Studies show that cloning is an inevitable phenomenon leading to a significant presence of code clones (as much as 10%-30% of the source code consists of cloned code) in large software systems. To effectively manage these clones, researchers have proposed multiple activities along two dimensions: 1) proactive clone management, and 2) post-mortem clone management. Proactive clone management emphasis is on activities that prevent the introduction of new clones into the source code (e.g., identifying the factors that influence developers to clone code). On the other hand, post-mortem clone management focuses on managing the existing
clones (e.g., detection or refactoring of a clone). In this thesis, we examine several open issues along both dimensions of clone management activities. For example, over 80% of research focuses on the detection of clones and studying their impact on code quality. However, limited research has examined the factors that make code more likely to be cloned. We find that an increase in the complexity
of a method increases the likelihood of code being cloned from that method. Moreover, while there exists more than 70 clone detection tools, limited techniques are available to evaluate the performance (especially recall) of these tools. We find that current state-of-the-art framework for the evaluation of clone detection tools tend to overestimate their recall. Hence, we propose a statistically rigorous framework to evaluate the recall of clone detection tools. In addition, little has been studied to determine the life expectancy of an introduced clone. We find that the characteristics of a clone (e.g., size of a clone, or the directory-structure distance between clone siblings) at the time of its introduction highly influence its life expectancy. Practitioners and researchers can leverage our findings for: a) effective proactive clone management (e.g., by developing tools to propose the abstraction
of code that is likely to be cloned) and b) effective post-mortem clone management by leveraging our framework to more accurately evaluate the recall of clone detection tools and by determining whether an introduced clone will be short-lived or long-lived to efficiently recommend clone management activities (e.g., annotation or refactoring of clones, especially the long-lived ones).