Link Prediction with Spark: Implementing Common Measures for Link Prediction in Graphs using Apache Spark and Maven
This is a Maven project.
This guide assumes you have an existing installation of Spark and Java JDK on your Linux device.
In order to avoid Java heap errors, spark commands include the option --driver-memory 4g
to give the Spark driver enough memory.
To compile the project's apps, simply open a terminal inside the project's root directory.
Run cd ./adamic-adar
, cd ./common-neighbors/
or cd ./jaccard-coefficient/
to navigate into the desired app. Then simply run:
./mvnw install
If you want to undo the previous command, run:
./mvnw clean
After compiling an app, you can execute it by running the following, inside its directory.
NOTE: Replace < input-file-directory > with the absolute path of your input file's directory. You can use the files inside the given input directory.
NOTE: Replace < number-of-the-top-results-to-be-displayed > with a number. E.g. 500 will produce the top-500 results.
For Common Neighbors:
spark-submit --class org.spark.CommonNeighbors --driver-memory 4g \
./target/commonneighbors-0.1.jar <input-file-directory> <number-of-the-top-results-to-be-displayed>
For Jaccard Coefficient:
spark-submit --class org.spark.JaccardCoefficient --driver-memory 4g \
./target/jaccardcoefficient-0.1.jar <input-file-directory> <number-of-the-top-results-to-be-displayed>
For Adamic/Adar:
spark-submit --class org.spark.AdamicAdar --driver-memory 4g \
./target/adamicadar-0.1.jar <input-file-directory> <number-of-the-top-results-to-be-displayed>
Each app takes as input a text file, containing an undirected graph.
- Lines starting with '#' are considered comments.
- Every edge of the graph is described in one line by two integers (the nodes) separated with space.
- Nodes must be described only as integers.
- Graph must be undirected, so if for example "1 2" is an existing edge of the graph, "2 1" must also be included.
Below is a demonstration of a graph created by using the structure described above.
# comment1
# comment2
2 1
1 3
4 2
3 4
5 3
4 5
1 2
3 1
2 4
4 3
3 5
5 4
You can test the apps by using the existing graph files from the input directory.
Inside the directory, there are two graph files from the snap.stanford.edu/data and a file called small_input containing the example described above.
- cA-AstroPh is taken from https://snap.stanford.edu/data/ca-AstroPh.html
- cA-CondMat is taken from https://snap.stanford.edu/data/ca-CondMat.html
To test the apps with the ca-AstroPh dataset and display the top-5000 results, assuming you are in the project's root directory without having compiled anything, you can do:
For Common Neighbors:
cd common-neighbors
./mvnw install
spark-submit --class org.spark.CommonNeighbors --driver-memory 4g ./target/commonneighbors-0.1.jar "$(realpath ../input/ca-AstroPh.txt)" 5000
For Jaccard Coefficient:
cd jaccard-coefficient
./mvnw install
spark-submit --class org.spark.JaccardCoefficient --driver-memory 4g ./target/jaccardcoefficient-0.1.jar "$(realpath ../input/ca-AstroPh.txt)" 5000
For Adamic/Adar:
cd adamic-adar
./mvnw install
spark-submit --class org.spark.AdamicAdar --driver-memory 4g ./target/adamicadar-0.1.jar "$(realpath ../input/ca-AstroPh.txt)" 5000
Enjoy the results!