This repository is the implementation of PDM, a graph-based binary function similarity analysis method, proposed in Position Distribution Matters: A Graph-based Binary Function Similarity Analysis Method.
This work is based on codes provided by PalmTree and CapsGNN.
The training and evaluating datasets can be compiled by yourselves, or find in BinKit.
- numpy
- r2pipe
- re
- scipy
- torch
- torchvision
- glob, json, multiprocessing, os, shutil, tqdm, etc.
Preperation:
~ $ git clone https://github.com/TyeYeah/PositionDistributionMatters.git
~ $ sudo apt install radare2 # or visit `https://github.com/radareorg/radare2/releases` for latest version (recommended).
~ $ conda install numpy ...
~ $ pip install r2pipe ...
~ $ cd PositionDistributionMatters
~/PositionDistributionMatters $
Train the BIRD
model and construct ACFG+
of function:
~ $ cd PositionDistributionMatters
~/PositionDistributionMatters $ cd BIRD
# prepare binaries in bin_bird/ and bin_pdm/
~/PositionDistributionMatters/BIRD $ python r2exp.py
# see main function for only `bird` model training, or instruction embedding
# training output in `data` dir in `BIRD`
# generate an `output` dir in `PDM`
Train and employ function ACFG+ graph embedding model
~/PositionDistributionMatters/BIRD $ cd ../PDM
~/PositionDistributionMatters/PDM $ python main.py --expmode train_s/train_t/evaluate_s/evaluate_t/embed_s/embed_t
The expmode
value includes:
train_s
: to train using siamese losstrain_t
: to train using triplet lossevaluate_s
: to evaluate model generated bytrain_s
evaluate_t
: to evaluate model generated bytrain_t
embed_s
: to generate graph embeddings using model generated bytrain_s
embed_t
: to generate graph embeddings using model generated bytrain_t
The corresponding embed_s
and embde_t
functions in main.py
needs to be customized by users.