Classifying And Predicting Software Security Vulnerabilities based on Reproducibility
Vulnerability , Security Failure , Hard to Reproduce Vulnerability , Vulnerability Classification , Software Metrics , Vulnerability Prediction , Machine Learning
Security defects are common in large software systems because of their size and complexity. Although efficient development processes, testing, and maintenance policies are applied to software systems, there are still a large number of vulnerabilities that can remain, despite these measures. Some vulnerabilities stay in a system from one release to the next one because they cannot be easily reproduced through testing. These vulnerabilities endanger the security of the systems. We propose vulnerability classification and prediction frameworks based on vulnerability reproducibility. The frameworks are effective to identify the types and locations of vulnerabilities in the earlier stage, and improve the security of software in the next versions (referred to as releases). We expand an existing concept of software bug classification to vulnerability classification (easily reproducible and hard to reproduce) to develop a classification framework for differentiating between these vulnerabilities based on code fixes and textual reports. We then investigate the potential correlations between the vulnerability categories and the classical software metrics and some other runtime environmental factors of reproducibility to develop a vulnerability prediction framework. The classification and prediction frameworks help developers adopt corresponding mitigation or elimination actions and develop appropriate test cases. Also, the vulnerability prediction framework is of great help for security experts focus their effort on the top-ranked vulnerability-prone files. As a result, the frameworks decrease the number of attacks that exploit security vulnerabilities in the next versions of the software. To build the classification and prediction frameworks, different machine learning techniques (C4.5 Decision Tree, Random Forest, Logistic Regression, and Naive Bayes) are employed. The effectiveness of the proposed frameworks is assessed based on collected software security defects of Mozilla Firefox.