Machine Learning for Health Informatics, Assignment 3:

Social network anonymization with SaNGreeA

Name change: This is now officially assignment 3, as Marcus' last one was already assignment 2.

Please read the original SaNGreeA paper contained in the 'paper' folder and / or watch the following YouTube movie before starting the assignment.

K-Anonymity in Social Networks

SaNGreeA is a Social Network Greedy Anonymization algorithm based on the concept of greedy clustering. Its input is given in the form of a graph structure; for this assignment, this graph is encoded as a list of feature vectors plus an adjacency list.

In order to compute its clusters, SaNGreeA takes into account two different distance measures (cost functions) from clusters to nodes / between 2 nodes:

The Generalization Information Loss (GIL), which measures the degree to which the features of a cluster would have to be generalized in order to incorporate a new node.
The Structural Information Loss representing the cost of edge generalization as the error probability in guessing the original network structure from the anonymized one.

Input data

For this assignment we are using

a subset of the adult dataset as feature vector input for our GIL computations
- please have a look at the respective CSV file
a pre-generated random graph structure in the form of an adjacency list for our SIL computations.
- please have a look at the respective CSV file
generalization hierarchies in the form of json files
- which can be found in the gen hierarchy folder

Output data

The structure of a sample output file can be seen here and should be pretty intuitive - '*' stands for maximum generalization (and therefore loss), all other entries can be looked up in the respective generalization hierarchy json files.

High-level algorithm walkthrough

The following steps would be taken given only the raw adults.csv input file (for this assignment, I have already prepared the first 4 steps in totality, only 3 functions have to be filled in, as we will see in the next section):

Read in the CSV, preprocessing / omitting erroneous entries
Generate a (random) graph structure for simulating a social network out of the resulting data
Instantiating generalization hierarchies for categorical values ('Italy' => 'Western Europe') as well as range values ([38-50] + 35 => [35-50])
Iterating over the dataset (1.), we execute the following steps:
- Initialize a new cluster with the given node
- While the cluster size is smaller k:
  - choose the node from the dataset with the best (lowest) total cost and add it to the cluster
If the last cluster is smaller in size than k, we disperse it's members amongst the previously computed clusters (this is not necessary for our example)

Code structure

data/ - input data as well as sample output data as described above
output/ - folder to write results to
paper/ - contains the original SaNGreeA paper
src/
- io/
  - csvInput.py - file to read .csv files into python dicts
  - jsonInput.py - file to read .json files into python dicts
  - output.py - file to write the resulting cluster structure to a .csv file
catGenHierarchy.py - class to read .json based generalization hierarchy files and compute individual category generalization costs
rangeGenHierarchy.py - class to compute individual range generalization costs
globals.py - central settings file, giving the following options
- K_FACTOR - the minimum cluster size
- ALPHA - the weight applied to the GIL cost measure
- BETA - the weight applied to the SIL cost measure
- GEN_WEIGHT_VECTORS - 3 different pre-settings for feature weights
- VECTOR - the one setting chosen for a given run
SaNGreeA.py - the main file: prepares input structures, runs the main loop, prints results
nodeCluster.py - Representing the anonymized clusters of size k, this class is responsible for computing the GIL / SIL cost functions in order to find the next best candidate node.

In order to run the code, simply execute the src/SaNGreeA.py file - if you need more orientation, there are still plenty of debug statements (sorry, I am not a python expert) currently commented out in the code...

Known implementation deviations from the original algorithm

(should not have drastic consequences for the purpose of the demo / assignment)

main loop starts with first node instead of node with max degree
no dispersion of last cluster

Your part

You work will consist of implementing 3 functions in the NodeCluster class:

computeRangeCost: Given this cluster and a candidate node, compute the cost for a potential expansion of the clusters current range (generalization, paper Definition 4, page 8)).
computeNewGeneralization: Given this cluster and a candidate node, compute the most specific possible generalization for the node to fit into the cluster (walk up the hierarchy, paper Definition 4, page 8):

cluster['sex'] = Male, candidate['sex'] = Female => result = 'all'

computeSIL: given the cluster and a candidate node, compute the structural neighborhood similarity using binary neighborhood vectors as given in the paper, Definition 11, page 12.

The difference between my reference implementation and the assignment version with method stubs is only a few dozen lines of code, so this should be fairly straightforward ;)

Submission

Please just fork the github repository, add your code, and send me a pull request when you're done.

Deadline

Friday, July 8th, 23:59 hours ;)

Good luck!

[Bernd Malle, June 15th, 2016]

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data		data
output		output
paper		paper
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning for Health Informatics, Assignment 3:

Social network anonymization with SaNGreeA

Input data

Output data

High-level algorithm walkthrough

Code structure

Known implementation deviations from the original algorithm

Your part

Submission

Deadline

Good luck!

About

Releases

Packages

Languages

License

cassinius/mlhi-ass2-anonymization

Folders and files

Latest commit

History

Repository files navigation

Machine Learning for Health Informatics, Assignment 3:

Social network anonymization with SaNGreeA

Input data

Output data

High-level algorithm walkthrough

Code structure

Known implementation deviations from the original algorithm

Your part

Submission

Deadline

Good luck!

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages