A Graph Convolutional Network (GCN) classifier that identifies Android malware families from APK call graphs. The model achieves ~72% top-1 accuracy across 70 malware families using a bag-of-characters encoding of function names as node features.
- Call graph extraction — each APK is analyzed with Androguard, producing a GML file where nodes are methods/functions and edges are calls between them.
- Feature encoding — each node's label (a method signature string) is encoded as a 52-dimensional bag-of-characters vector (26 lowercase + 26 uppercase letter frequencies).
- GCN classification — a two-layer Graph Convolutional Network reads the graph, aggregates node embeddings via mean pooling, and outputs a log-softmax distribution over malware families.
- Training — 3-fold stratified cross-validation during training; 5-fold evaluation at test time.
| File | Description |
|---|---|
My First Malware.ipynb |
Main Jupyter notebook — data loading, model definition, training, and evaluation |
generateCG.sh |
Bash script that batch-generates GML call graphs from a directory of APK files using Androguard |
- Python 3.7 (the notebook kernel is pinned to 3.7.2; newer Python may work but is untested)
- CUDA-capable GPU recommended (the notebook defaults to
cuda:1; it falls back to CPU) - Linux/macOS for
generateCG.sh(usesfindand Bash process concurrency)
torch # PyTorch
dgl # Deep Graph Library
networkx
scikit-learn
matplotlib
numpy
androguard # Only needed for call graph generation, not model training
Install with pip:
pip install torch dgl networkx scikit-learn matplotlib numpy androguardDGL version note: The notebook's message passing uses
fn.copy_u/fn.sum(fn.copy_srcwas renamed tofn.copy_uand the old name was removed in DGL 2.x). On a recent DGL (≥ 1.0) some other legacy calls used here — e.g.dgl.DGLGraph()andfrom_networkx— may also need updating, or you can switch theGCNclass to the simplifieddgl.nn.GraphConv.
The notebook looks for GML files one level up from the repo root:
../amd_data/
├── <FamilyName>/
│ ├── variety<N>/
│ │ ├── <apk_sha256>.gml
│ │ └── ...
│ └── ...
└── ...
Example:
../amd_data/
├── Airpush/
│ ├── variety0/
│ │ ├── abc123.gml
│ │ └── def456.gml
│ ├── variety1/
│ │ └── ...
├── AndroRAT/
│ └── ...
The path structure encodes the label: the first path component after amd_data/ is the malware family name (used as the predicted class), and the second must be named variety<N> where N is a non-negative integer — this matches the on-disk layout of the AMD dataset from Argus Lab. The notebook's label-parsing regex requires this exact naming scheme.
The notebook was developed against the Android Malware Dataset (AMD) from Argus Lab, which organizes samples in exactly the directory hierarchy above. Several sources provide compatible collections:
| Dataset | URL | Notes |
|---|---|---|
| AMD (Argus Lab) | http://amd.arguslab.org | ~24,553 samples across 135 variants of 71 families; the closest match to this notebook's expected layout |
| Drebin | https://drebin.mlsec.org | 5,560 samples from 179 families; widely used benchmark; manual directory structuring required |
| AndroZoo | https://androzoo.uni.lu | Large-scale repository (millions of APKs); requires registration; VirusTotal labels available for family assignment |
| MalDroid 2020 | https://www.unb.ca/cic/datasets/maldroid-2020.html | ~20,000 samples; family labels included |
| MalGenome | http://malgenome.com | One of the earliest curated datasets (1,260 samples, 49 families) |
| VirusShare | https://virusshare.com | Large community collection; requires registration; labels come from VirusTotal AV names |
Tip: If you use a dataset other than AMD, reorganize the APKs into
<family>/variety<N>/<hash>.apkdirectories before runninggenerateCG.sh(e.g.Airpush/variety0/abc123.apk). The family name is the top-level directory and becomes the predicted label; thevariety<N>subdirectory name is required by the notebook's label-parsing regex. AV scan results (e.g., from VirusTotal or AVClass) are the standard way to assign family labels.
Run generateCG.sh pointing at your APK root:
chmod +x generateCG.sh
./generateCG.sh ../amd_data gmlThis recursively finds every *.apk under ../amd_data, runs androguard cg on each, and writes a sibling .gml file. Existing GML files are skipped. Jobs are parallelized with Bash background processes.
For large datasets this step is slow (Androguard can take 10–60 seconds per APK depending on size). Run it overnight or on a machine with many cores.
-
Launch Jupyter:
jupyter notebook "My First Malware.ipynb" -
Select the DGL kernel (or whichever Python 3.7 environment has the dependencies above).
-
Run all cells in order. The notebook will:
- Load and split GML files into train (922 samples, 15 per family) and test (581 samples, 30 per family) sets
- Define the GCN classifier
- Train using 3-fold cross-validation until loss ≤ 0.1
- Print per-class accuracy on the test set
The notebook sets use_cuda = True in the second cell and derives a device variable from it:
use_cuda = True
device = torch.device("cuda" if use_cuda and torch.cuda.is_available() else "cpu")This selects cuda:0 when a GPU is available, or falls back to CPU automatically. To force CPU-only execution, change use_cuda = False in the second cell.
Trained on the AMD dataset with the default hyperparameters:
- Overall accuracy: 71.94%
- Optimizer: Adam, lr = 1e-6
- Loss: NLLLoss
- Classes: 70 malware families
Per-family accuracy ranges from 0% (Mtk) to 100% (Aples, Boxer, Cova, Erop, and ~15 others).
- Kipf & Welling, Semi-Supervised Classification with Graph Convolutional Networks (ICLR 2017)
- Li et al., Significant Permission Identification for Machine-Learning-Based Android Malware Detection
- AMD dataset: Wei et al., Deep Ground Truth Analysis of Current Android Malware (DIMVA 2017)
- Androguard documentation
- DGL documentation