This project includes three different subprojects:
- A frequent subgraph mining project based on gSpan
- A discriminative subgraph mining project based on GAIA
- A
pytorchenvironment that encodes features (preprocesses graph data) produced bygSpanandGAIAand trains on processed data
- gSpan setup
- Create a virtual environment
envwithingSpanproject by runningpython3 -m venv env. - Install packages by running
pip install -r requirements.txt - Replace the files
gspan.pyandgraph.pyatenv/lib/python3/site-packages/gspan_mining/with the files provided atgspan/nx_support/
- Create a virtual environment
- GAIA setup
- Download link and instruction of GAIA can be found here
- PyTorch Environment
- Create a virtual environment
envwithinpytorchproject by runningpython3 -m venv env. - Install packages by running
pip install -r requirements.txt
- Create a virtual environment
Since we use TUD Benchmark, we need to convert TUD data to Networkx data. To do so run the following command within the pytorch project.
# python utilities/convert.py -d {dataset_name}
# for example if we want to convert MUTAG to Networkx, run the following
python utilities/convert.py -d MUTAGThis command will convert TUD to Networkx and it will save a {dataset}.dat file under pytorch/data_networkx/{dataset}.dat.
Next, copy the generated {dataset}.dat file to gspan/data_nx/{dataset}/{dataset}.dat. Well-known dataset MUTAG already exists as an example.
Next we have to convert networkx data to a specific format that can be used with gSpan. To do so run the following command within the gSpan project.
# python utilities/nx_to_graph.py -d {dataset_name}
# for example if we want to convert MUTAG to Networkx, run the following
python utilities/nx_to_graph.py -d MUTAGThis will produce a {dataset}.graph file under gspan/data_graph/{dataset}.graph.
Now that we have .graph data we can generate relational features by running the following command within the gspan project.
# python app.py -d {dataset_name} -s {support_in_percentage} -n {length_of_dataset/nr_of_graphs}
# for example if we want to generate subgraphs of MUTAG with 90%, run the following
python app.py -d MUTAG -s 90 -n 188This command should generate 13 subgraphs and save them under /data_nx_features/{dataset_support}.dat.
To generate files that are used as an input for GAIA software, run the following command withing the pytorch environment. This command with automatically convert TUD datasets to a specific format used in GAIA. Two files will be generated {dataset_name}_node_file.txt and {dataset_name}_edge_file.txt.
# python utilities/tud_to_gaia.py -d {dataset_name}
# for example if we want to convert MUTAG to GAIA, run the following
python utilities/tud_to_gaia.py -d MUTAGOnce the files are ready, move them to GAIA directory and follow the instructions on how to use GAIA.
For MUTAG dataset, GAIA should generate six discriminative features. Generated features are saved in a file under the name patternResult.txt. Now, to convert these features to Networkx format move the file under gspan/data_gaia/{dataset_name}/patternResult.txt and run the following command within the gspan environment:
# python gaia_to_nx.py -d {dataset_name} -nf {nr_of_node_features/node_labels}
# for example if for MUTAG, we have 7 node features
python gaia_to_nx.py -d MUTAG -nf 7This command will create a new file under gspan/data_gaia/{dataset_name}/{dataset_name}_gaia_nx.dat.
To make use of feature selection methods, we need to flatten the graphs and their respective features genereated by one of the methods listed above.
Feature Selection make use of target variable of the dataset, so we need to extract that first. To do so, run the following command within the pytorch environment:
# python utilities/target_extraction.py -d {dataset_name}
# for example utilities/target_extraction.py -d MUTAG
python utilities/target_extraction.py -d {dataset_name}This command will extract the target value and save it in a file under data_networkx/{dataset_name}_target.dat. Now the feature selection methods exist within the gspan environment. To make use of the feature selection method, first create a playground, for example: gspan/mutag_playground. In this directory, add the following files: MUTAG.dat which can be found under the gspan/data_nx/MUTAG/MUTAG.dat and the generated file that contains the target of the dataset at the pytorch/data_networkx/MUTAG_target.dat. The other file needed is the set of subgraphs/features. For example, if you set the support of gSpan to 20 for MUTAG dataset, it will produce 39492 subgraphs. In principle, any set of subgraphs can be used with feature selection techniques, but usually is used when the set of subgraphs is very large. Once you have the following structure in one of the playgrounds:
gspan/
mutag_playground/
MUTAG.dat # this is the set of graphs
MUTAG_target.dat # this is the target variable for each graph
MUTAG_20.dat # this is the set of subgraphs generated by gSpan with support 20
simply run the following command:
# python {dataset_name}_playground/selection.py -d {dataset_name} -f {set_of_features} -s {selection_method} -k {top_k_features}
# for example mutag1_playground/selection.py -d MUTAG -f MUTAG_20 -s xgb -k 20
python {dataset_name}_playground/selection.py -d {dataset_name} -f {set_of_features} -s {selection_method} -k {top_k_features}This command will generate two files with first being a tabular representation (dataframe) of flattened graphs named {dataset_name}_{set_of_features}_flatten_df.dat and the set of selected features {dataset_name}_{set_of_features}_{selection_method}_{top_k_features}.dat. The later file can than be used for preprocessing the graphs.
First, under pytorch/experiments/ create a folder following your dataset name, e.g. pytorch/experiments/MUTAG. Inside this folder add the following structure:
pytorch/
experiments/
MUTAG/
data/
features/
results/
config.json
The directories data and results will be automatically populated once the experiments are executed. Under the features add the file generated by gSpan that contains your features {dataset_support}.dat. For example:
pytorch/
experiments/
MUTAG/
data/
features/
MUTAG_90.dat
results/
config.json
The same way add features generated by GAIA.
The config.json contains configuration about the model hyperparameters and data preprocessing. For example, if we want to encode relational features from the file MUTAG_90.dat, with and without labels, the following configuration should be provided.
{
"model": {
"seed": 0,
"batch_size": 32,
"hidden_channel": 32,
"learning_rate": 0.001,
"layers": 4,
"scheduler": true,
"step_size": 50,
"decay": 0.9,
"dropout": 0.5,
"folds": 10,
"epochs": 351
},
"features": [
{
"name": "MUTAG_90",
"labelled": false,
"device": "cuda:0"
},
{
"name": "MUTAG_90",
"labelled": true,
"device": "cuda:1"
}
]
}Once the config.json is ready, first preprocess the data by running the following command within pytorch project:
#python preprocess.py -d {experiment}
python preprocess.py -d MUTAGThis command will preprocess the data and provide the new processed dataset under the pytorch/experiments/{dataset}/data/processed/
Now to train, simply run the following command within pytorch project:
#python preprocess.py -d {experiment}
python train.py -d MUTAGThis command will train based on the configurations given in the config.json file of an experiment. This will generate results based on the dataset name under the pytorch/experiments/{dataset}/results/{dataset}_{label}.json.