The most important security vulnerabilities thus far have been found via laborious code auditing.

Also, this is the only way vulnerabilities can be found and fixed during development. However, as software production rates increases, so does the need for a reliable, automated method for checking or classifying this code in order to prioritize and organize human efforts in manual checks.

The art of Vulnerability Detection mostly revolves around extraction of features like

  • proper API usage patterns i.e. pairing malloc with free or getConnection with closeConnection,
  • missing checks, like ensuring a number is non-zero before dividing by it,
  • lack of input validation, leading to injections, buffer overflows, …
  • lack of access controls, which may lead to confidential information being leaked, altered or denied access to.

Pattern Recognition Approach

The key is the technique used for extracting features, which range from conventional parsers, data-flow and control-flow analysis, and even directly text mining the source code.

Our own Dr. Yamaguchi et al. (2011, 2012) took lead in mimicking the mental process behind the daily grind of the code auditor: searching for similar instances of recently discovered vulnerabilities. They sensibly call this vulnerability extrapolation.

The gist:

  • parse the source code,
  • embed into vector space via a bag-of-words-like method,
  • perform semantic analysis to obtain particular matrices, and
  • then compare to known-vulnerable code using standard distance functions.
  • Since we expect to extract meaningful patterns from the code, we also need a "comprehensive and feature-rich representation" of it, which gives birth to Code Property Graph.

Code Property Graph - a new representation by combining three existing representations

The basic idea is to parse the source code, represent it as a graph by combining formally sound representations like

  • the standard Abstract Syntax Tree (AST) specific to the grammar defined by programming language
  • the Control Flow Graph (CFG), which represents the order in which statements are executed depending on the conditions, and
  • the Data Flow Graph , which tells us how variables are initialized, transformed and dispatched by statements.

These are better understood by example. Consider the following C snippet:

void foo() {       
     int x = source();       
     if( x < MAX )  {                  
          int y = 2*x;                 
          sink(y);       
      }
}

In this context,

  • source is meant to be the place in the code where the variables might be tainted by the attacker, and
  • sink is meant to be the function which might be vulnerable if the input from the source is not sanitized. The AST, CFG, and DFG for such a snippet would look like this:
Code graph comparison

Figure 2. Comparison of graph representations of code

Neither of them, alone, can tell us the whole story about what is happening in that snippet, which is, of course, necessary to identify vulnerabilities, since these are generally very context-dependent.

Yamaguchi and his team in 2012 put forward the idea of combining these three graphs into a single multigraph (basically a graph that allows multiple edges for a pair of nodes) which they call the code property graph:

A code property graph

Figure 3. A code property graph

Essentially, it’s like the AST with additional edges "borrowed" from the DFG and CFG, and are represented with different colors and labels.

By manually crafting such graph traversals, Yamaguchi was able to identify 88 vulnerabilities in the Linux kernel. Out of those, these 18 were previously unknown:

Zero-day vulnerabilities found

Figure 4. Zero-day vulnerabilities found using CPG

That’s 15 CVEs right there, made with a graph representation of code and four measly traversals.