AutoBERT

This repository contains the implementation of our paper Efficient Cross-Architecture Binary Function Embeddings through Knowledge Distillation.

News

[2025/04/10] We publish the code and models used during our research

Requirements

Python 3.11+
transformers
sentence_transformers
CUDA for optimal performance
Ghidra 11.x

Models

The pre-trained models (trained on the EKLAVYA and MISA datasets) are available on Google Drive (696 MB).

Datasets

We used the EKLAVYA and MISA for our pre-trained models. To train a new model on a custom dataset, the dataset should be aligned in the following way:

Anchor Model

For the pre-training task, use Masked language modeling (MLM) on raw assembly listings. Prepare a huggingface dataset and start the training at 01_tokenizer.py.

Start the training at step 02_mlm.py. Configure your custom paths in the config-dictionary.

Binary Code Similarity Classification (BCSD) Task

For the BCSD downstream task, we used the Triplet Loss for the training process (refer to Section 3.5 in the paper). A custom dataset needs to contain the following columns:

sample1
sample2
label (e.g. 1.0 for similar samples, 0.0 for dissimilar samples)

Start the training at step 03_sbert.py. Configure your custom paths in the config-dictionary.

Custom Architecture

To add support for a custom processor architecture, prepare a dataset that aligns pairs of (anchor, target) assembly code listings (i.e. a ParallelSentencesDataset). If you want to use the provided anchor model bert-x86-eklavya, the anchor samples should be for the x86_64 architecture.

Start the training at step 04_mlm.py. Configure your custom paths in the config-dictionary.

Usage

This guide provides a quick overview of how to use the scripts in the AutoBERT repository.

1. Tokenizer Creation (`01_tokenizer.py`)

Build a custom tokenizer from a corpus of assembly code.

python  -c <path_to_corpus_file(s)> -o <output_path> -e <encoding>

-c: Path(s) to corpus file(s) (multiple allowed).
-o: Directory to save the tokenizer.
-e: Encoding of the corpus files (default: utf-8).

2. Masked Language Model Training (`02_mlm.py`)

Train a BERT model with masked language modeling on assembly code.

python 02_mlm.py

Configure paths and parameters in the config section of the script:

tokenizer: Path to the tokenizer created in step 1.
model_save_path: Directory to save the trained model.

3. SBERT Training (03_sbert.py)

Train a Sentence-BERT model for generating embeddings of assembly functions.

python 03_sbert.py

Configure paths and parameters in the config section:

model_load_path: Path to the pre-trained BERT model.
model_save_path: Directory to save the trained SBERT model.

4. Knowledge Distillation with MSE Loss (04_mse.py)

Distill knowledge from a teacher model to a student model using MSE loss.

python 04_mse.py

Configure paths and parameters in the config section:

teacher_model: Path to the teacher model.
student_model: Directory to save the student model.

5. Evaluation (05_evaluation.py)

Evaluate models using triplet accuracy and binary classification metrics.

python 05_evaluation.py

Configure paths and parameters in the config section:

models: List of model paths to evaluate.
eklavya and misa: Paths to datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
autobert		autobert
datasets		datasets
ghidra_scripts		ghidra_scripts
models		models
01_tokenizer.py		01_tokenizer.py
02_mlm.py		02_mlm.py
03_sbert.py		03_sbert.py
04_mse.py		04_mse.py
05_evaluation.py		05_evaluation.py
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AutoBERT

News

Requirements

Models

Datasets

Anchor Model

Binary Code Similarity Classification (BCSD) Task

Custom Architecture

Usage

1. Tokenizer Creation (`01_tokenizer.py`)

2. Masked Language Model Training (`02_mlm.py`)

3. SBERT Training (03_sbert.py)

4. Knowledge Distillation with MSE Loss (04_mse.py)

5. Evaluation (05_evaluation.py)

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

securityinmobility/autobert-binary-embedding

Folders and files

Latest commit

History

Repository files navigation

AutoBERT

News

Requirements

Models

Datasets

Anchor Model

Binary Code Similarity Classification (BCSD) Task

Custom Architecture

Usage

1. Tokenizer Creation (01_tokenizer.py)

2. Masked Language Model Training (02_mlm.py)

3. SBERT Training (03_sbert.py)

4. Knowledge Distillation with MSE Loss (04_mse.py)

5. Evaluation (05_evaluation.py)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

1. Tokenizer Creation (`01_tokenizer.py`)

2. Masked Language Model Training (`02_mlm.py`)

Packages