Skip to content

bushidocodes/android-malware-classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Android Malware Classifier

A Graph Convolutional Network (GCN) classifier that identifies Android malware families from APK call graphs. The model achieves ~72% top-1 accuracy across 70 malware families using a bag-of-characters encoding of function names as node features.

How it works

  1. Call graph extraction — each APK is analyzed with Androguard, producing a GML file where nodes are methods/functions and edges are calls between them.
  2. Feature encoding — each node's label (a method signature string) is encoded as a 52-dimensional bag-of-characters vector (26 lowercase + 26 uppercase letter frequencies).
  3. GCN classification — a two-layer Graph Convolutional Network reads the graph, aggregates node embeddings via mean pooling, and outputs a log-softmax distribution over malware families.
  4. Training — 3-fold stratified cross-validation during training; 5-fold evaluation at test time.

Repository contents

File Description
My First Malware.ipynb Main Jupyter notebook — data loading, model definition, training, and evaluation
generateCG.sh Bash script that batch-generates GML call graphs from a directory of APK files using Androguard

Prerequisites

  • Python 3.7 (the notebook kernel is pinned to 3.7.2; newer Python may work but is untested)
  • CUDA-capable GPU recommended (the notebook defaults to cuda:1; it falls back to CPU)
  • Linux/macOS for generateCG.sh (uses find and Bash process concurrency)

Python dependencies

torch          # PyTorch
dgl            # Deep Graph Library
networkx
scikit-learn
matplotlib
numpy
androguard     # Only needed for call graph generation, not model training

Install with pip:

pip install torch dgl networkx scikit-learn matplotlib numpy androguard

DGL version note: The notebook's message passing uses fn.copy_u / fn.sum (fn.copy_src was renamed to fn.copy_u and the old name was removed in DGL 2.x). On a recent DGL (≥ 1.0) some other legacy calls used here — e.g. dgl.DGLGraph() and from_networkx — may also need updating, or you can switch the GCN class to the simplified dgl.nn.GraphConv.

Data preparation

Required directory layout

The notebook looks for GML files one level up from the repo root:

../amd_data/
├── <FamilyName>/
│   ├── variety<N>/
│   │   ├── <apk_sha256>.gml
│   │   └── ...
│   └── ...
└── ...

Example:

../amd_data/
├── Airpush/
│   ├── variety0/
│   │   ├── abc123.gml
│   │   └── def456.gml
│   ├── variety1/
│   │   └── ...
├── AndroRAT/
│   └── ...

The path structure encodes the label: the first path component after amd_data/ is the malware family name (used as the predicted class), and the second must be named variety<N> where N is a non-negative integer — this matches the on-disk layout of the AMD dataset from Argus Lab. The notebook's label-parsing regex requires this exact naming scheme.

Step 1 — Obtain APKs

The notebook was developed against the Android Malware Dataset (AMD) from Argus Lab, which organizes samples in exactly the directory hierarchy above. Several sources provide compatible collections:

Dataset URL Notes
AMD (Argus Lab) http://amd.arguslab.org ~24,553 samples across 135 variants of 71 families; the closest match to this notebook's expected layout
Drebin https://drebin.mlsec.org 5,560 samples from 179 families; widely used benchmark; manual directory structuring required
AndroZoo https://androzoo.uni.lu Large-scale repository (millions of APKs); requires registration; VirusTotal labels available for family assignment
MalDroid 2020 https://www.unb.ca/cic/datasets/maldroid-2020.html ~20,000 samples; family labels included
MalGenome http://malgenome.com One of the earliest curated datasets (1,260 samples, 49 families)
VirusShare https://virusshare.com Large community collection; requires registration; labels come from VirusTotal AV names

Tip: If you use a dataset other than AMD, reorganize the APKs into <family>/variety<N>/<hash>.apk directories before running generateCG.sh (e.g. Airpush/variety0/abc123.apk). The family name is the top-level directory and becomes the predicted label; the variety<N> subdirectory name is required by the notebook's label-parsing regex. AV scan results (e.g., from VirusTotal or AVClass) are the standard way to assign family labels.

Step 2 — Generate call graphs

Run generateCG.sh pointing at your APK root:

chmod +x generateCG.sh
./generateCG.sh ../amd_data gml

This recursively finds every *.apk under ../amd_data, runs androguard cg on each, and writes a sibling .gml file. Existing GML files are skipped. Jobs are parallelized with Bash background processes.

For large datasets this step is slow (Androguard can take 10–60 seconds per APK depending on size). Run it overnight or on a machine with many cores.

Running the notebook

  1. Launch Jupyter:

    jupyter notebook "My First Malware.ipynb"
  2. Select the DGL kernel (or whichever Python 3.7 environment has the dependencies above).

  3. Run all cells in order. The notebook will:

    • Load and split GML files into train (922 samples, 15 per family) and test (581 samples, 30 per family) sets
    • Define the GCN classifier
    • Train using 3-fold cross-validation until loss ≤ 0.1
    • Print per-class accuracy on the test set

GPU configuration

The notebook sets use_cuda = True in the second cell and derives a device variable from it:

use_cuda = True
device = torch.device("cuda" if use_cuda and torch.cuda.is_available() else "cpu")

This selects cuda:0 when a GPU is available, or falls back to CPU automatically. To force CPU-only execution, change use_cuda = False in the second cell.

Results

Trained on the AMD dataset with the default hyperparameters:

  • Overall accuracy: 71.94%
  • Optimizer: Adam, lr = 1e-6
  • Loss: NLLLoss
  • Classes: 70 malware families

Per-family accuracy ranges from 0% (Mtk) to 100% (Aples, Boxer, Cova, Erop, and ~15 others).

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages