Empirical Studies of Code Clone Genealogies
Barbour, Liliane Jeanne
MetadataShow full item record
Two identical or similar code fragments form a clone pair. Previous studies have identified cloning as a risky practice. Therefore, a developer needs to be aware of any clone pairs so as to properly propagate any changes between clones. A clone pair experiences many changes during the creation and maintenance of software systems. A change can either maintain or remove the similarity between clones in a clone pair. If a change maintains the similarity between clones, the clone pair is left in a consistent state. However, if a change makes the clones no longer similar, the clone pair is left in an inconsistent state. The set of states and changes experienced by clone pairs over time form an evolution history known as a clone genealogy. In this thesis, we provide a formal definition of clone genealogies, and perform two case studies to examine clone genealogies. In the first study, we examine clone genealogies to identify fault-prone “patterns” of states and changes. We also build prediction models using clone metrics from one snapshot and compare them to models that include historical evolutionary information about code clones. We examine three long-lived software systems and identify clones using Simian and CCFinder clone detection tools. The results show that there is a relationship between the size of the clone and the time interval between changes and fault-proneness of a clone pair. Additionally, we show that adding evolutionary information increases the precision, recall, and F-Measure of fault prediction models by up to 26%. In our second study, we define 8 types of late propagation and compare them to other forms of clone evolution. Our results not only verify that late propagation is more harmful to software systems, but also establish that some specific cases of late propagations are more harmful than others. Specifically, two cases are most risky: (1) when a clone experiences inconsistent changes and then a re-synchronizing change without any modification to the other clone in a clone pair; and (2) when two clones undergo an inconsistent modification followed by a re-synchronizing change that modifies both the clones in a clone pair.