REMEND: Neural Decompilation for Reverse Engineering Math Equations from Binary Executables

This repo contains the synthetic and real-world datasets, along with the trained models for the REMEND paper.

Abstract

Analysis of binary executables implementing mathematical equations can benefit from the reverse engineering of semantic information about the implementation. Traditional algorithmic reverse engineering tools either do not recover semantic information or rely on dynamic analysis and symbolic execution with high reverse engineering time. Algorithmic tools also require significant re-engineering effort to target new platforms and languages. Recently, neural methods for decompilation have been developed to recover human-like source code, but they do not extract semantic information explicitly. We develop REMEND, a neural decompilation framework to reverse engineer math equations from binaries to explicitly recover program semantics like data flow and order of operations. REMEND combines a transformer encoder-decoder model for neural decompilation with algorithmic processing for enhanced symbolic reasoning necessary for math equations. REMEND is the first work to demonstrate that transformers for neural decompilation go beyond source code and reason about program semantics in the form of math equations. We train on a synthetically generated dataset containing multiple implementations and compilations of math equations to produce a robust neural decompilation model and demonstrate retargettability. REMEND obtains an accuracy of 89.8% to 92.4% across three Instruction Set Architectures (ISAs), three optimization levels, and two programming languages with a single trained model, extending the capability of state-of-the-art neural decompilers. We achieve high accuracy with a small model of upto 12 million parameters and an average execution time of 0.132 seconds per function. On a real-world dataset collected from open-source programs, REMEND generalizes better than state-of-the-art neural decompilers despite being trained with synthetic data, achieving 8% higher accuracy.

Using the synthetic dataset

The compiled equations and assembly are present in dataset.zip. The assembly are tokenized and present in tokenized.zip. The tokenized files are also preprocessed using fairseq-preprocess.

Extract only tokenized.zip if you just want to use the synthetic data to train new models. Extract the dataset.zip if you want to tokenize in a different way or want to modify the data before processing.

Using the real world dataset

Follow these steps to compile and evaluate the real-world dataset

Run make.sh to compile the source files into ELF and assembly files
Run python3 collect_dataset.py to disassemble the ELF functions for REMEND processing
Run generate.sh to run REMEND, generate equations, and evaluate them for correctness

The dataset will be present in dataset/<arch>.[eqn,asm]. The results will be present in generated/base/<arch>_res_<beamsize>.txt.

The folder real_world_dataset/related_evals contains scripts to evaluate related works BTC, SLaDE, and Nova. Each of the related works need to be setup before evaluating. See each script for further instructions.

Replicate the synthetic dataset

Following are the steps to recreate the dataset present in dataset.zip and tokenized.zip

Download the FAIR dataset from paper "Deep Learning for Symbolic Math" prim fwd.tar.gz (this is a large file)
Extract the downloaded data: tar xvf prim_fwd.tar.gz
Compile the equations and disassemble them via the REMEND disassembler: ./compile.sh
Combine the compiled equations and assembly files, remove duplicates, and split: ./combine.sh
Tokenize the split assembly files: ./tokenize.sh

Training command

The fairseq library is used for training the Transformer. The following command with different parameters is used to train each model.

fairseq-train <tokenized-dataset> --task translation --arch transformer \
    --optimizer adam --weight-decay 0.001 --lr 0.0005 --lr-scheduler inverse_sqrt \
    --max-source-positions 1024 --max-target-positions 1024 \
    --encoder-attention-heads 8 --decoder-attention-heads 8 --encoder-embed-dim 384 --decoder-embed-dim 128 \
    --encoder-ffn-embed-dim 1536 --decoder-ffn-embed-dim 512 --decoder-output-dim 128 --dropout 0.05 \
    --max-tokens 20000 --max-update 100000 \
    --no-epoch-checkpoints --keep-best-checkpoints 3 \
    --save-dir <save-dir> --log-file <save-dir>/training.log

The following command runs a trained model and generates translations:

fairseq-generate <tokenized-dataset> --task translation --arch transformer \
    --max-source-positions 1024 --max-target-positions 1024 \
    --path <checkpoint> --results-path <out-dir> --gen-subset <train/test/valid> --beam 1

Ablations

The ablations folder contains the data for the 3 ablations present in the paper: REMEND without constant identification, REMEND with equations in postfix, and REMEND trained with REMaQE data included.

Downloads last month: -; Downloads are not tracked for this model. How to track