Skip to content

AIS2Lab/MalwareGPT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MalwareGPT

Introduction

This repo contains the experimental code for the paper: “Rethinking and Exploring String-Based Malware Family Classification in the Era of LLMs and RAG”

Directory Structure

src/
│  requirements.txt
│
├─experiment/
│  ├─CollectingDataset/
│  │      download_and_extract.py       *Extract malware raw files*
│  │
│  └─MainExperiment/                    *The main experiments are completed based on the scripts in this directory. Different experimental configurations mainly rely on changing some settings of the dataset*
│          step10_addtional_transfer_json_file.py  *Generate fine-tuning training set data*
│          step10_fine_tuning.py         *Conduct fine-tuning*
│          step11_sample_test.py         *Test samples for fine-tuning*
│          step12_GnenerateQueryCsv.py
│          step1_addtional_get_black_list.py  *Generate blacklists*
│          step1_build_db.py             *Generate the original version of the string database*
│          step2_get_min_len.py          *Obtain the minimum meaningful string length based on ChatGPT*
│          step3_del_min_len.py          *Modify the string database based on the minimum string length*
│          step4_filter_character.py     *Remove recursive strings*
│          step5_filter.py               *Remove database records based on frequency ranking to keep the number of records within 10,000*
│          step6_save_embedding_to_json.py  *Generate training set vectors*
│          step7_get_vector_from_sample.py  *Generate test set vectors*
│          step8_generate_db.py          *Generate vector database*
│          step9_query.py                *Generate query results*
│
├─test/                                 *Some scripts*
│      experiment1_filter_dataset.py
│      experiment1_k_means.py
│
├─tool/                                 *External tools*
│  └─Floss/                             *For static extraction of strings from malware*
│          floss
│
└─utils/                                *Configuration folder*
    │  configs.py
    │  db_functions.py
    │  gpt_embedding.py
    │  init_configs.py
    │
    └─chatgpt/                          *LLM functionality support*
        │  chatgpt.py
        │  configs.py
        │  prompts.py
        │  util.py
        │  __init__.py

Data Collection

  1. Run the download_and_extract.py script to collect malware from the MalwareBazaar website. The malware files are in ZIP format, and the password for all ZIP files is infected. The collected data will be extracted using floss and saved to the data directory.

  2. The original malware ZIP files are stored in the data/dataset/rawfile directory, and the extracted data from floss is saved in the data/dataset/floss_result directory. The malware is categorized based on the year of record, distinguishing between malware recorded in 2024 and those recorded before 2023.

    python src/download_and_extract.py

In the data\Floss_result directory, there is a collection of hash files for our test files, which corresponds to the Training (Vector DB), Testing Set (2024 samples), and Fine-tuning Training Set mentioned in the paper.

Main Experiment

  1. Generate the vector database and conduct the main experiments mentioned in the paper by running the scripts in the MainExperiment folder step by step.
  2. First, call step1_build_db.py to generate the initial version of the database, and then call step1_addtional_get_black_list.py to generate the black list json.
  3. Next, use step2_get_min_len.py to obtain length information. After plotting the CDF graph, we determine the minimum length and then call step3_del_min_len.py to filter the string database based on the minimum length. The variable for the minimum length is in the utils/configs.py file, with the variable name min_len.
  4. Then we use step4_filter_character.py to remove recursive strings that appear in the FLOSS extraction. If the string database is still unbalanced after the previous filtering, call step5_filter.py to forcibly limit each table to retain a maximum of 10,000 records based on the string frequency.
  5. Call step6_save_embedding_to_json.py and step7_get_vector_from_sample.py to generate the cache files for the embedding vectors of the training set and the testing set, respectively. This step is to ensure that tokens are not repeatedly consumed.
  6. Call step8_generate_db.py to generate the vector database. Note that the Milvus database we use is only applicable to the Linux environment.
  7. Call step9_query.py to query the testing set embedding generated in step 5 using the generated Milvus database. The default query is top 10.
  8. If no fine-tuning experiment is required, skip this step. This step first repeatedly calls step7_get_vector_from_sample.py and changes the direction of TestingSetFilePath to generate the training set embeddings and testing set embeddings for fine-tuning. Call step10_addtional_transfer_json_file.py to convert the training set embeddings for fine-tuning into the training file for the OpenAI fine-tuning model. Then use step10_fine_tuning.py to fine-tune the model. When the training is successful, you will receive a notification email in the email associated with your OpenAI account, which includes the fine-tuning model ID. In step11_sample_test.py, rename the variable FineTuningModelId and reposition the variable TestingSetFilePath to your fine-tuning testing set embeddings directory. Run this script, and you will obtain a report for the fine-tuning experiment.
  9. Restore the variable repositioning from step 8 and call step12_GnenerateQueryCsv.py to generate an analysis report based on the rule-based model.

RQ Experiment

  1. Dynamic experiment

    The source in the Data collection is no longer FLOSS but Falcon Sandbox.

  2. Length experiment

    Skip step 3 of the Main Experiment.

  3. K-means experiment

    Further process the cache file of the embedding vectors of the testing set generated in step 5 of the Main Experiment using a script.

  4. Fine-tuning experiment

    Operate as in the Main experiment.

note:

Some scripts only work in Linux:
step8_generate_db.py
download_and_extract.py
Step9_query.py

Results

The results of the experiments, including query reports and any additional metrics, will be stored in the data/results/ directory. Each experiment's results will be organized in a separate subdirectory.

Dependencies

To install the required dependencies, run the following command:

pip install -r requirements.txt

About

Rethinking and Exploring String-Based Malware Family Classification in the Era of LLMs and RAG

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages