Android Malware Classifier

A Graph Convolutional Network (GCN) classifier that identifies Android malware families from APK call graphs. The model achieves ~72% top-1 accuracy across 70 malware families using a bag-of-characters encoding of function names as node features.

How it works

Call graph extraction — each APK is analyzed with Androguard, producing a GML file where nodes are methods/functions and edges are calls between them.
Feature encoding — each node's label (a method signature string) is encoded as a 52-dimensional bag-of-characters vector (26 lowercase + 26 uppercase letter frequencies).
GCN classification — a two-layer Graph Convolutional Network reads the graph, aggregates node embeddings via mean pooling, and outputs a log-softmax distribution over malware families.
Training — 3-fold stratified cross-validation during training; 5-fold evaluation at test time.

Repository contents

File	Description
`My First Malware.ipynb`	Main Jupyter notebook — data loading, model definition, training, and evaluation
`generateCG.sh`	Bash script that batch-generates GML call graphs from a directory of APK files using Androguard

Prerequisites

Python 3.7 (the notebook kernel is pinned to 3.7.2; newer Python may work but is untested)
CUDA-capable GPU recommended (the notebook defaults to cuda:1; it falls back to CPU)
Linux/macOS for generateCG.sh (uses find and Bash process concurrency)

Python dependencies

torch          # PyTorch
dgl            # Deep Graph Library
networkx
scikit-learn
matplotlib
numpy
androguard     # Only needed for call graph generation, not model training

Install with pip:

pip install torch dgl networkx scikit-learn matplotlib numpy androguard

DGL version note: The notebook's message passing uses fn.copy_u / fn.sum (fn.copy_src was renamed to fn.copy_u and the old name was removed in DGL 2.x). On a recent DGL (≥ 1.0) some other legacy calls used here — e.g. dgl.DGLGraph() and from_networkx — may also need updating, or you can switch the GCN class to the simplified dgl.nn.GraphConv.

Data preparation

Required directory layout

The notebook looks for GML files one level up from the repo root:

../amd_data/
├── <FamilyName>/
│   ├── variety<N>/
│   │   ├── <apk_sha256>.gml
│   │   └── ...
│   └── ...
└── ...

Example:

../amd_data/
├── Airpush/
│   ├── variety0/
│   │   ├── abc123.gml
│   │   └── def456.gml
│   ├── variety1/
│   │   └── ...
├── AndroRAT/
│   └── ...

The path structure encodes the label: the first path component after amd_data/ is the malware family name (used as the predicted class), and the second must be named variety<N> where N is a non-negative integer — this matches the on-disk layout of the AMD dataset from Argus Lab. The notebook's label-parsing regex requires this exact naming scheme.

Step 1 — Obtain APKs

The notebook was developed against the Android Malware Dataset (AMD) from Argus Lab, which organizes samples in exactly the directory hierarchy above. Several sources provide compatible collections:

Dataset	URL	Notes
AMD (Argus Lab)	http://amd.arguslab.org	~24,553 samples across 135 variants of 71 families; the closest match to this notebook's expected layout
Drebin	https://drebin.mlsec.org	5,560 samples from 179 families; widely used benchmark; manual directory structuring required
AndroZoo	https://androzoo.uni.lu	Large-scale repository (millions of APKs); requires registration; VirusTotal labels available for family assignment
MalDroid 2020	https://www.unb.ca/cic/datasets/maldroid-2020.html	~20,000 samples; family labels included
MalGenome	http://malgenome.com	One of the earliest curated datasets (1,260 samples, 49 families)
VirusShare	https://virusshare.com	Large community collection; requires registration; labels come from VirusTotal AV names

Tip: If you use a dataset other than AMD, reorganize the APKs into <family>/variety<N>/<hash>.apk directories before running generateCG.sh (e.g. Airpush/variety0/abc123.apk). The family name is the top-level directory and becomes the predicted label; the variety<N> subdirectory name is required by the notebook's label-parsing regex. AV scan results (e.g., from VirusTotal or AVClass) are the standard way to assign family labels.

Step 2 — Generate call graphs

Run generateCG.sh pointing at your APK root:

chmod +x generateCG.sh
./generateCG.sh ../amd_data gml

This recursively finds every *.apk under ../amd_data, runs androguard cg on each, and writes a sibling .gml file. Existing GML files are skipped. Jobs are parallelized with Bash background processes.

For large datasets this step is slow (Androguard can take 10–60 seconds per APK depending on size). Run it overnight or on a machine with many cores.

Running the notebook

Launch Jupyter:

jupyter notebook "My First Malware.ipynb"

Select the DGL kernel (or whichever Python 3.7 environment has the dependencies above).
Run all cells in order. The notebook will:
- Load and split GML files into train (922 samples, 15 per family) and test (581 samples, 30 per family) sets
- Define the GCN classifier
- Train using 3-fold cross-validation until loss ≤ 0.1
- Print per-class accuracy on the test set

GPU configuration

The notebook sets use_cuda = True in the second cell and derives a device variable from it:

use_cuda = True
device = torch.device("cuda" if use_cuda and torch.cuda.is_available() else "cpu")

This selects cuda:0 when a GPU is available, or falls back to CPU automatically. To force CPU-only execution, change use_cuda = False in the second cell.

Results

Trained on the AMD dataset with the default hyperparameters:

Overall accuracy: 71.94%
Optimizer: Adam, lr = 1e-6
Loss: NLLLoss
Classes: 70 malware families

Per-family accuracy ranges from 0% (Mtk) to 100% (Aples, Boxer, Cova, Erop, and ~15 others).

References

Kipf & Welling, Semi-Supervised Classification with Graph Convolutional Networks (ICLR 2017)
Li et al., Significant Permission Identification for Machine-Learning-Based Android Malware Detection
AMD dataset: Wei et al., Deep Ground Truth Analysis of Current Android Malware (DIMVA 2017)
Androguard documentation
DGL documentation

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github		.github
ci		ci
.gitignore		.gitignore
My First Malware.ipynb		My First Malware.ipynb
README.md		README.md
generateCG.sh		generateCG.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Android Malware Classifier

How it works

Repository contents

Prerequisites

Python dependencies

Data preparation

Required directory layout

Step 1 — Obtain APKs

Step 2 — Generate call graphs

Running the notebook

GPU configuration

Results

References

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Android Malware Classifier

How it works

Repository contents

Prerequisites

Python dependencies

Data preparation

Required directory layout

Step 1 — Obtain APKs

Step 2 — Generate call graphs

Running the notebook

GPU configuration

Results

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages