This repo contains the experimental code for the paper: “Rethinking and Exploring String-Based Malware Family Classification in the Era of LLMs and RAG”
src/
│ requirements.txt
│
├─experiment/
│ ├─CollectingDataset/
│ │ download_and_extract.py *Extract malware raw files*
│ │
│ └─MainExperiment/ *The main experiments are completed based on the scripts in this directory. Different experimental configurations mainly rely on changing some settings of the dataset*
│ step10_addtional_transfer_json_file.py *Generate fine-tuning training set data*
│ step10_fine_tuning.py *Conduct fine-tuning*
│ step11_sample_test.py *Test samples for fine-tuning*
│ step12_GnenerateQueryCsv.py
│ step1_addtional_get_black_list.py *Generate blacklists*
│ step1_build_db.py *Generate the original version of the string database*
│ step2_get_min_len.py *Obtain the minimum meaningful string length based on ChatGPT*
│ step3_del_min_len.py *Modify the string database based on the minimum string length*
│ step4_filter_character.py *Remove recursive strings*
│ step5_filter.py *Remove database records based on frequency ranking to keep the number of records within 10,000*
│ step6_save_embedding_to_json.py *Generate training set vectors*
│ step7_get_vector_from_sample.py *Generate test set vectors*
│ step8_generate_db.py *Generate vector database*
│ step9_query.py *Generate query results*
│
├─test/ *Some scripts*
│ experiment1_filter_dataset.py
│ experiment1_k_means.py
│
├─tool/ *External tools*
│ └─Floss/ *For static extraction of strings from malware*
│ floss
│
└─utils/ *Configuration folder*
│ configs.py
│ db_functions.py
│ gpt_embedding.py
│ init_configs.py
│
└─chatgpt/ *LLM functionality support*
│ chatgpt.py
│ configs.py
│ prompts.py
│ util.py
│ __init__.py
-
Run the
download_and_extract.pyscript to collect malware from the MalwareBazaar website. The malware files are in ZIP format, and the password for all ZIP files isinfected. The collected data will be extracted usingflossand saved to thedatadirectory. -
The original malware ZIP files are stored in the
data/dataset/rawfiledirectory, and the extracted data fromflossis saved in thedata/dataset/floss_resultdirectory. The malware is categorized based on the year of record, distinguishing between malware recorded in 2024 and those recorded before 2023.python src/download_and_extract.py
In the data\Floss_result directory, there is a collection of hash files for our test files, which corresponds to the Training (Vector DB), Testing Set (2024 samples), and Fine-tuning Training Set mentioned in the paper.
- Generate the vector database and conduct the main experiments mentioned in the paper by running the scripts in the
MainExperimentfolder step by step. - First, call
step1_build_db.pyto generate the initial version of the database, and then callstep1_addtional_get_black_list.pyto generate the black list json. - Next, use
step2_get_min_len.pyto obtain length information. After plotting the CDF graph, we determine the minimum length and then callstep3_del_min_len.pyto filter the string database based on the minimum length. The variable for the minimum length is in theutils/configs.pyfile, with the variable namemin_len. - Then we use
step4_filter_character.pyto remove recursive strings that appear in the FLOSS extraction. If the string database is still unbalanced after the previous filtering, callstep5_filter.pyto forcibly limit each table to retain a maximum of 10,000 records based on the string frequency. - Call
step6_save_embedding_to_json.pyandstep7_get_vector_from_sample.pyto generate the cache files for the embedding vectors of the training set and the testing set, respectively. This step is to ensure that tokens are not repeatedly consumed. - Call
step8_generate_db.pyto generate the vector database. Note that the Milvus database we use is only applicable to the Linux environment. - Call
step9_query.pyto query the testing set embedding generated in step 5 using the generated Milvus database. The default query is top 10. - If no fine-tuning experiment is required, skip this step. This step first repeatedly calls
step7_get_vector_from_sample.pyand changes the direction ofTestingSetFilePathto generate the training set embeddings and testing set embeddings for fine-tuning. Callstep10_addtional_transfer_json_file.pyto convert the training set embeddings for fine-tuning into the training file for the OpenAI fine-tuning model. Then usestep10_fine_tuning.pyto fine-tune the model. When the training is successful, you will receive a notification email in the email associated with your OpenAI account, which includes the fine-tuning model ID. Instep11_sample_test.py, rename the variableFineTuningModelIdand reposition the variableTestingSetFilePathto your fine-tuning testing set embeddings directory. Run this script, and you will obtain a report for the fine-tuning experiment. - Restore the variable repositioning from step 8 and call
step12_GnenerateQueryCsv.pyto generate an analysis report based on the rule-based model.
-
Dynamic experiment
The source in the Data collection is no longer FLOSS but Falcon Sandbox.
-
Length experiment
Skip step 3 of the Main Experiment.
-
K-means experiment
Further process the cache file of the embedding vectors of the testing set generated in step 5 of the Main Experiment using a script.
-
Fine-tuning experiment
Operate as in the Main experiment.
note:
Some scripts only work in Linux:
step8_generate_db.py
download_and_extract.py
Step9_query.py
The results of the experiments, including query reports and any additional metrics, will be stored in the data/results/ directory. Each experiment's results will be organized in a separate subdirectory.
To install the required dependencies, run the following command:
pip install -r requirements.txt