Decision Tree-Based Signature Generation Framework for IoT Malware Detection

For my MSc dissertation, I developed a framework for generating YARA signatures using Data Mining and Machine Learning Techniques which detect IoT malware. Chapter 3 of my dissertation contains more in-depth details on the design but the main highlights are below.

Sample Collection Phase

Scripts used in this phase: csv_to_json_convertor.py.

As ELF executable files are typically found on IoT devices, samples of these were used.
- Benign executables were harvested from freely available firmware downloads for IoT devices (e.g., IP cameras) using BinWalk.
- Malware executables were sourced from researchers at Yokohama National University, Japan - they are available (by request) from here.
All samples were verified as malware or benign using VirusTotal's API v2 and Didier Stevens VirusTotal Search script.
For the malware samples, the following script was used to identify which family they came from: AVClass2.

Malware Analysis Phase

Scripts used in this phase: dataset_generator.py.

Dynamic Analysis was used to elicit behavioural features (system calls) from all of the IoT samples.
- The samples were run, one by one, in an air-gapped sandbox, specifically aimed at analysis of malware which target common architectures found in IoT devices (e.g., MIPS, ARM, Aarch64). As some malware require an internet connection as part of their functionality, iNetSim was used to simulate common internet services (e.g., HTTP/HTTPS, DNS, FTP).
- The sandbox utilised strace in order to capture system calls that the samples were making. These System Call Trace Logs were captured in files and used as input into the next phase.

Data Mining Phase

Scripts used in this phase: feature_extraction_selection.py and malware_classification.py.

N-grams was used for Feature Extraction while TF-IDF was used for Feature Selection. The use of these methods was necessary because the System Call Trace Logs, produced in the previous phase, can be quite extensive and can also include a lot of data which is redundant/adds no value.
- N-grams were used to elicit features with context, for example, a trigram n=3 shows a brief sequence of System Calls (i.e., what came before and after a single System Call).
- TF-IDF aids dimensionality reduction of these N-grams features by assigning weights to each of them which is relevant to their importance across the whole dataset.
- The feature extraction text module provided by Scikit-Learn was used to perform feature extraction and feature selection.
Decision Trees were used for malware classification.
- The decision tree module provided by Scikit-Learn was used as in addition to providing classification models, it also has a rich API for classification metrics. It also easily integrates with other open-source packages (e.g., Graphviz which was used for plotting the decision trees).
The outputs of this phase was a set of classification results (e.g., Accuracy, Recall, Precision and a Confusion Matrix graphic), a Decision Tree graphic (see example below) and a set of Decision Tree rules.

Malware Detection Phase

Scripts used in this phase: signature_generator.py and malware_detection.py.

This phase comprised of two distinct functions namely, Signature Generation and Malware Detection.
- Regarding Signature Generation, the Decision Tree rules (from the classification phase) were a series of 'If...Else' statements (shown below) that formed the basis for the automatic generation of YARA signatures (shown below).
Regarding Malware Detection, the system call datasets generated in the Malware Analysis Phase were used an input into a Malware Detection module (developed in Python). This module used the YARA API to first compile the signatures to confirm they were syntactically correct and then secondly to classify each of system call datasets as benign or malware. An example of the output from this phase is below:

The following shows the workflow (data flow, inputs and outputs) for all of the phases (incuding the experimental phase):

As part of the project, I ran 200+ experiments (automated using a shell script and CSV file of paramters) in order to answer the following research questions:

What combinations of N-gram size, number of features and decision tree depth result in the best classification performance?
What combinations of N-gram size, number of features and decision tree depth result in a high detection rate for IoT malware?
To what extent is classification performance an indicator of the detection rate for IoT malware?
What is the minimum number of System Call-based signatures which can achieve a high detection rate for IoT malware?

The answers to these questions can be found in Chapter 5 of my dissertation.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
docs		docs
source		source
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Decision Tree-Based Signature Generation Framework for IoT Malware Detection

Sample Collection Phase

Malware Analysis Phase

Data Mining Phase

Malware Detection Phase

About

Uh oh!

Uh oh!

Languages

Uh oh!

License

Uh oh!

fywalsh/signature-generation-iot-malware-detection

Folders and files

Latest commit

History

Repository files navigation

Decision Tree-Based Signature Generation Framework for IoT Malware Detection

Sample Collection Phase

Malware Analysis Phase

Data Mining Phase

Malware Detection Phase

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages