Software Security Flaw Prediction Using Rich Contextualized Language Use Vectors: A Case Study on the Linux Kernel
Abstract
One of the major threats to the security of software systems is the occurrence of security vulnerabilities, which can potentially cause a variety of problems including, but not limited to, information loss, privilege escalation, data breach, and system failure. Software vulnerability prediction is therefore a critical part of software engineering. A variety of approaches have been proposed to detect the most likely locations of vulnerabilities in large codebases. Many of the existing methods rely on traditional software metrics such as lines of code, complexity and code churn. In this study, we explored the possibility of using Rich Contextualized Language Use Vectors (RCLUVs) as a feature set for predicting vulnerabilities in the context of the Linux kernel.
The RCLUV of a source code file contains elements representing the frequency of each programming language feature being used, both individually and in the context of other features. This code profile is generated by parsing the source code of a program and analyzing the resulting parse tree.
We mined vulnerabilities reported in the National Vulnerability Database (NVD) and built a dataset containing all known vulnerable files in the 14-year history of the Linux kernel. We built and evaluated RCLUV-based prediction models using different machine learning algorithms under both experimental and realistic scenarios. Analysis of the learning curves of the models demonstrates that RCLUVs are effective for training machine learning models to learn vulnerability patterns. Performance comparison of our models with four different popular vulnerability prediction models shows that our approach outperforms the models trained on includes, function calls, and software metrics in an experimental setup. Moreover, our models can successfully predict more than half of the future and unseen vulnerabilities in a real-life setting when given enough training data.