Skip to content

Sandspeare/optrans

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OpTrans: Enhancing Binary Code Similarity Detection with Function Inlining Re-Optimization

About

OpTrans (Re-Optimization Transformer), is an innovative framework fuses binary code optimization techniques with the transformer model for BCSD. By OpTrans employs an algorithm based on binary program analysis to determine which functions should be inlined, followed by binary rewriting techniques to effectuate re-optimization on binaries. Our goal is to provide an effective tool for researchers and practitioners in binary code similarity detection, with our models accessible on the Hugging Face Model Hub.

News

Intuition

This document will present how function inlining optimization improve binary code similarity detection.

Function Faust_next in sc3-plugins-HOAEncLebedev501.so compiled with -O0 (sc3-plugins-HOAEncLebedev501.so-O0.i64) O0

Function Faust_next in sc3-plugins-HOAEncLebedev501.so compiled with -O3 (sc3-plugins-HOAEncLebedev501.so-O3.i64) O3

Function Faust_next in sc3-plugins-HOAEncLebedev501.so compiled with -O0 and processed by function inlining optimization (sc3-plugins-HOAEncLebedev501.so-O0-inline.i64) O0-inline

The idb files in ./Intuition are generated by IDA-8.3

QuickStart

This document will help you set up and start using the OpTrans model for embedding generation.

Requirements

Ensure you have Python and PyTorch installed on your system. Then, install the Transformers library using pip:

pip install transformers

Preparing Tokenizers and Models

Import necessary libraries and initialize the model and tokenizers:

import torch
from transformers import AutoModel, AutoTokenizer

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained("sandspeare/optrans", trust_remote_code=True)
encoder = AutoModel.from_pretrained("sandspeare/optrans", trust_remote_code=True).to(device)
tokenizer.pad_token = tokenizer.unk_token

Example Use Cases

Function inlining optimization for BCSD

  1. Load your binary code dataset. For demonstration, we use a pickle file containing binary code snippets for similarity compare.
with open("./CaseStudy/casestudy.json") as fp:
   data = json.load(fp)
  1. Encode the binary code.
asm_O0 = tokenizer([data["O0"]], padding=True, return_tensors="pt").to(device)
asm_embedding_O0 = encoder(**asm_O0)

asm_O0_inline = tokenizer([data["O0_inline"]], padding=True, return_tensors="pt").to(device)
asm_embedding_O0_inline = encoder(**asm_O0_inline)

asm_O3 = tokenizer([data["O3"]], padding=True, return_tensors="pt").to(device)
asm_embedding_O3 = encoder(**asm_O3)
  1. Perform similarity comparison:
sim_O0vsO3 = torch.mm(asm_embedding_O0, asm_embedding_O3.T).squeeze() / 0.07
sim_O0_inlinevsO3 = torch.mm(asm_embedding_O0_inline, asm_embedding_O3.T).squeeze() / 0.07

Details

In this document, we provide an overview of the contents of this repository and instructions for accessing the materials.

  1. CaseStudy.ipynb: A Jupyter Notebook showcasing the zero-shot performance of our proposed model using a case study. Please open this file to get an in-depth view of how our model works and the results it produces.

  2. CaseStudy: A folder containing binary code for the case study used in the Jupyter Notebook.

  3. Intuition: A folder containing screenshots of how inling optimization eliminate false positive in binaries. Corresponding functions also presented in IDB file.

Processing Data

We provide a example script to process the binary code. The script is located at scripts/process.py. You can use the script to process your own binaries.

/path/to/idat64 -c -A -S"scripts/process.py /path/to/output /path/to/inlinefuncs"  /path/to/binary

Here is a demo

/path/to/idat64 -c -A -S"scripts/process.py Intuition/sc3-plugins-HOAEncLebedev501.json Intuition/O0_inline.pkl" Intuition/sc3-plugins-HOAEncLebedev501.so-O0.i64

Fine-tune

We provide a script to fine-tune the base model with your own datasets.

python Intuition/fine_tune.py

Fast evalution

We provide scripts to fast evaluate the model performance of binary code similarity detection.

python Intuition/create_embeddings.py --output_path embedding_dataset
python Intuition/fast_eval.py --data_path embedding_dataset

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published