Facebook AI introduces a neural-based decompiler framework called N-Bref, which improves traditional decompilation systems’ performance accuracy. The research led by Jishen Zhao is a collaboration between FAIR and UCSD STABLE Lab. This study presents a comprehensive analysis of how each component of a neural-based decompiler design influences program recovery’s overall accuracy across different data set configurations.
What are Decompilers?
Decompilers are very useful in detecting anomalies and vulnerabilities in computer securities and forensics. They convert low-level executable code (like assembly instructions) back to a high-level programming language (such as C++), making them more comfortable for people to read. They can also be employed to detect likely viruses, debug programs, translate obsolete code, recover lost source code, etc.
Traditionally, a decompiler program is manually designed with heuristics from human experts. A domain expert would write down many rules for every pair of programming languages (e.g., C++ and assembly), a time-consuming process that could take many years to produce and need extra attention and manipulation in complicated situations. Additionally, an upgrade of the source language leads to a significant amount of maintenance work.
A neral based decompiler: N-Bref
N-Bref automates the design flow from data set generation to neural network training and evaluation. It does not require a human engineer. It is the first framework that repurposes state-of-the-art neural networks (such as Transformers used in neural machine translation) to handle the deeply structured input and output data in practical code decompilation tasks. N-Bref works on the assembly code compiled from generated C++ programs that routinely call standard libraries and simple real codebase-like solutions.
The team started by encoding the input assembly code into a graph structure to adequately represent distinct instruction relationships. Then, they encoded the graph structure using existing Graph Embedding tools to obtain representations of the assembly code. The abstract syntax (AST) tree encodes the high-level semantic code. The team utilized memory-augmented transformers (that handle highly structured assembly code) to build and iteratively refine the AST tree. Lastly, the AST tree is converted into a high-level semantic language. A tool is provided to collect training data that generates and joins the representation of high-level programming languages for neural decompiler research.
The team states that N-Bref outperforms traditional decompilers, essentially when the input program is long and has sophisticated control flows. The system can decompile real-world C code from the standard C library and fundamental code bases written by humans to solve real-world problems.
This is the first time an end-to-end trainable code decompiler system has performed efficiently in widely-used programming languages such as C++. This advancement leads a step towards a practical decompiler system operating on a large-scale codebase. The team has also developed the first data set generation tool for neural-based decompiler development and testing. The tool generates code similar to the ones written by human programmers and is also suitable for developing learning-based methodologies.