Every piece of software and code contains flaws. While some of these flaws are minor and simply impair an application’s functioning, others have the potential to compromise its security. It is important to find and remedy these security flaws for application security.
Code scanning is one such framework that now uses machine learning for detecting potential security flaws in software that identifies vulnerabilities and corrects them before they are released into production, reducing the security risks they offer.
GitHub Code Scanning identifies potential security flaws in source code by applying properly established CodeQL analysis rules. GitHub intends to extend its rule-based security code scanning capabilities to less prevalent vulnerability patterns by automatically inferring new rules from the existing ones using machine learning techniques.
CodeQL searches encompass knowledge of a wide number of potential sources of user data (for example, web frameworks), as well as potentially problematic sinks (such as libraries for executing SQL queries). This helps to detect scenarios in which unsafe user data ends up in a harmful place.
To manually construct rules to find potential vulnerability flaw patterns demands security specialists to analyze existing libraries and private code. This is a challenging task, given the vast amount of existing libraries.
With the increasing independence of machine learning algorithms, the researchers sought to use trained ML models to spot vulnerable code using a dataset of massive such samples. To that end, they used both supervised and unsupervised learning to evaluate each code snippet as susceptible or safe. Their findings show that the supervised learned model outperforms the model trained using unsupervised learning.
Now rather than experts to categorize millions of snippets for training the model, GitHub is using existing CodeQL rules as a ground-truth oracle that can determine whether a code snippet is secure. This allows easy labeling of tens of millions of code snippets from over a hundred thousand public repositories. The obtained data is utilized to develop a prediction model as a training set.
GitHub has established a novel approach to test whether this model can genuinely forecast new vulnerabilities, not merely those currently captured by the CodeQL rules that were used as oracles. This entails training the model on labels provided by previous CodeQL rules and then testing it against vulnerabilities discovered by a newer set of CodeQL rules.
According to the team, this method can be used to confirm whether a model has learned to find vulnerabilities that were not included in the previous set of rules. However, this is based on the assumption that the newer rules increase the number of accurately detected vulnerabilities.
It is interesting to note the way GitHub uses CodeQL rules to recognize the characteristics in a code sample. Instead of treating code as text with NLP approaches, GitHub can recognize details like the access path, API name, enclosing function body, and so on. Furthermore, GitHub can investigate possibly valuable features that are less obvious to the human eye, such as the function called argument index.
The team created a vocabulary from the training data. They feed lists of indices into a very simple deep learning classifier, with a few levels of feature-by-feature processing, followed by feature concatenation and a few layers of combined processing.
A code snippet is translated into a set of features using CodeQL at prediction time and then provided to the ML model to determine the likelihood that a specific code snippet constitutes a vulnerability.