- About
- Getting Started
- Installing, Testing, Building
- Running locally
- Deployment
- Continuous Ιntegration
- Todo
- Built With
- License
- Acknowledgments
Hybrid Girvan Newman. Code for the paper "A Distributed Hybrid Community Detection Methodology for Social Networks."
The proposed methodology is an iterative, divisive community detection process that combines the network topology features
of loose similarity and local edge betweenness measure, along with the user content information in order to remove the
inter-connection edges and thus unravel the subjacent community structure. Even if this iterative process might sound
computationally over-demanding, its application is certainly not prohibitive, since it can be safely concluded
from the experimentation results that the aforementioned measures are that well-informative and highly representative,
so merely few iterations are required to converge to the final community hierarchy at any case.
Implementation last tested with Python 3.6,
Apache Spark 2.4.5
and GraphFrames 0.8.0
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.
You need to have a machine with Python = 3.6, Apache Spark = 2.4.5, GraphFrames = 0.8.0 and any Bash based shell (e.g. zsh) installed. For Apache Spark = 2.4.5 you will also need Java 8.
$ python3.6 -V
Python 3.6.9
echo $SHELL
/usr/bin/zsh
In order to run the main.py or the tests you will need to set the following environmental variables in your system (or in the spark.env file):
$ export SPARK_HOME="<Path to Spark Home>"
$ export PYSPARK_SUBMIT_ARGS="--packages graphframes:graphframes:0.8.0-spark2.4-s_2.11 pyspark-shell"
$ export JAVA_HOME="<Path to Java 8>"
$ cd $SPARK_HOME
/usr/local/spark
$ ./bin/pyspark --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.5
/_/
Using Scala version 2.11.12, OpenJDK 64-Bit Server VM, 1.8.0_252
Branch HEAD
Compiled by user centos on 2020-02-02T19:38:06Z
Revision cee4ecbb16917fa85f02c635925e2687400aa56b
Url https://gitbox.apache.org/repos/asf/spark.git
Type --help for more information.
All the installation steps are being handled by the Makefile.
If you don't want to go through the setup steps and finish the installation and run the tests, execute the following command:
$ make install server=local
If you executed the previous command, you can skip through to the Running locally section.
$ make help
-----------------------------------------------------------------------------------------------------------
DISPLAYING HELP
-----------------------------------------------------------------------------------------------------------
make delete_venv
Delete the current venv
make create_venv
Create a new venv for the specified python version
make requirements
Upgrade pip and install the requirements
make run_tests
Run all the tests from the specified folder
make setup
Call setup.py install
make clean_pyc
Clean all the pyc files
make clean_build
Clean all the build folders
make clean
Call delete_venv clean_pyc clean_build
make install
Call clean create_venv requirements run_tests setup
make help
Display this message
-----------------------------------------------------------------------------------------------------------
$ make clean server=local
make delete_venv
make[1]: Entering directory '/home/drkostas/Projects/HGN'
Deleting venv..
rm -rf venv
make[1]: Leaving directory '/home/drkostas/Projects/HGN'
make clean_pyc
make[1]: Entering directory '/home/drkostas/Projects/HGN'
Cleaning pyc files..
find . -name '*.pyc' -delete
find . -name '*.pyo' -delete
find . -name '*~' -delete
make[1]: Leaving directory '/home/drkostas/Projects/HGN'
make clean_build
make[1]: Entering directory '/home/drkostas/Projects/HGN'
Cleaning build directories..
rm --force --recursive build/
rm --force --recursive dist/
rm --force --recursive *.egg-info
make[1]: Leaving directory '/home/drkostas/Projects/HGN'
$ make create_venv server=local
Creating venv..
python3.6 -m venv ./venv
$ make requirements server=local
Upgrading pip..
venv/bin/pip install --upgrade pip wheel setuptools
Collecting pip
.................
The tests are located in the tests
folder. To run all of them, execute the following command:
$ make run_tests server=local
source venv/bin/activate && \
.................
To build the project locally using the setup.py command, execute the following command:
$ make setup server=local
venv/bin/python setup.py install '--local'
running install
.................
In order to run the code now, you should place under the data/input_graphs the graph you
want the communities to be identified from.
You will also only need to create a yml file for any new graph before executing the main.py.
There two already configured yml files: confs/quakers.yml and confs/hamsterster.yml with the following structure:
tag: dev # Required
spark:
- config: # The spark settings
spark.master: local[*] # Required
spark.submit.deployMode: client # Required
spark_warehouse_folder: data/spark-warehouse # Required
spark.ui.port: 4040
spark.driver.cores: 5
spark.driver.memory: 8g
spark.driver.memoryOverhead: 4096
spark.driver.maxResultSize: 0
spark.executor.instances: 2
spark.executor.cores: 3
spark.executor.memory: 4g
spark.executor.memoryOverhead: 4096
spark.sql.broadcastTimeout: 3600
spark.sql.autoBroadcastJoinThreshold: -1
spark.sql.shuffle.partitions: 4
spark.default.parallelism: 4
spark.network.timeout: 3600s
dirs:
df_data_folder: data/dataframes # Folder to store the DataFrames as parquets
spark_warehouse_folder: data/spark-warehouse
checkpoints_folder: data/checkpoints
communities_csv_folder: data/csv_data # Folder to save the computed communities as csvs
input:
- config: # All properties required
name: Quakers
nodes:
path: data/input_graphs/Quakers/quakers_nodelist.csv2 # Path to the nodes file
has_header: true # Whether they have a header with the attribute names
delimiter: ','
encoding: ISO-8859-1
feature_names: # You can rename the attribute names (the number should be the same as the original)
- id
- Historical_Significance
- Gender
- Birthdate
- Deathdate
- internal_id
edges:
path: data/input_graphs/Quakers/quakers_edgelist.csv2 # Path to the edges file
has_header: true # Whether they have a header with the source and dest
has_weights: false # Whether they have a weight column
delimiter: ','
type: local
run_options: # All properties required
- config:
cached_init_step: false # Whether the cosine similarities and edge_betweenness been already been computed
# See the paper for info regarding the following attributes
feature_min_avg: 0.33
r_lvl1_thres: 0.50
r_lvl2_thres: 0.85
max_edge_weight: 0.50
betweenness_thres: 10
max_sp_length: 2
min_comp_size: 2
max_steps: 30 # Max steps for the algorithm to run if it doesn't converge
features_to_check: # Which attributes to take into consideration for the cosine similarities
- id
- Gender
output: # All properties required
- config:
logs_folder: data/logs
save_communities_to_csvs: false # Whether to save the computed communities in csvs or not
visualizer:
dimensions: 3 # Dimensions of the scatter plot (2 or 3)
save_img: true
folder: data/plots
steps: # The steps to plot
- 0 # The step before entering the main loop
- -1 # The Last step
The !ENV
flag indicates that a environmental value follows. For example you can set: logs_folder: !ENV ${LOGS_FOLDER}
You can change the values/environmental var names as you wish.
If a yaml variable name is changed/added/deleted, the corresponding changes should be reflected
on the Configuration class and the yml_schema.json too.
First, make sure you are in the created virtual environment:
$ source venv/bin/activate
(venv)
OneDrive/Projects/HGN dev
$ which python
/home/drkostas/Projects/HGN/venv/bin/python
(venv)
Now, in order to run the code you can either call the main.py
directly, or the HGN
console script.
$ python main.py -h
usage: main.py -c CONFIG_FILE [-d] [-h]
A Distributed Hybrid Community Detection Methodology for Social Networks.
Required Arguments:
-c CONFIG_FILE, --config-file CONFIG_FILE
The configuration yml file
Optional Arguments:
-d, --debug Enables the debug log messages
-h, --help Show this help message and exit
# Or
$ hgn --help
usage: hgn -c CONFIG_FILE [-d] [-h]
A Distributed Hybrid Community Detection Methodology for Social Networks.
Required Arguments:
-c CONFIG_FILE, --config-file CONFIG_FILE
The configuration yml file
Optional Arguments:
-d, --debug Enables the debug log messages
-h, --help Show this help message and exit
It is recommended that you deploy the application to a Spark Cluster.
Please see:
- Spark Cluster Overview [Apache Spark Docs]
- Apache Spark on Multi Node Cluster [Medium]
- Databricks Cluster
- Flintrock [Cheap & Easy EC2 Cluster]
For the continuous integration, the CircleCI service is being used. For more information you can check the setup guide.
Again, you should set the above-mentioned environmental variables (reference) and for any modifications, edit the circleci config.
Read the TODO to see the current task list.
- Apache Spark 2.4.5 - Fast and general-purpose cluster computing system
- GraphFrames 0.8.0 - A package for Apache Spark which provides DataFrame-based Graphs.
- CircleCI - Continuous Integration service
This project is licensed under the GNU License - see the LICENSE file for details.