forked from galanisl/LinkPrediction
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
154 lines (119 loc) · 5.15 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---
title: "LinkPrediction"
author: "Gregorio Alanis-Lobato"
date: false
output: rmarkdown::github_document
---
Implementation of different classes of link predictors and methods to assess and
visualise their performance.
# Introduction
Complex systems can be represented by graphs $G(V, E)$, where $E$ is the set of
interactions (edges) between a set $V$ of system components (nodes). Link
predictors assign likelihood scores of interaction to all node pairs that are
disconnected in the observable network topology. These scores can lead to the
prediction of future friendships in a social network or future direct flights
between cities in an airport transportation network.
To assess the performance of link predictors, one normally removes an increasing
number of edges $E^R \subset E$ from the network of interest, applies a link
prediction method to the pruned network and evaluates its performance with one
of the following metrics:
- `recall_at_k`: Reports of the fraction of candidate links with the $|E^R|$
highest likelihood scores that are in $E^R$ (Recall at $k$, where $k = |E^R|$).
- `aupr`: The area under the Precision-Recall curve.
- `auroc`: The area under the Receiving Operating Characteristic curve.
- `avg_prec`: The average Precision at the point where Recall reaches its
maximum value of 1.
`LinkPrediction` implements the following classes of link prediction techniques:
- Neighbourhood-based predictors, which assign high likelihood scores to node
pairs that share many neighbours (`lp_cn`, `lp_pa`, `lp_jc`, `lp_dice`, `lp_aa`,
`lp_ra` and `lp_l3`).
- CAR-based predictors, which assign high likelihood scores to node pairs that
share many neighbours that also interact with themselves (`lp_car`, `lp_cpa`,
`lp_caa` and `lp_cra`).
- Embedding-based methods rank non-adjacent nodes through distances on a network
projection in two-dimensional Euclidean space (`lp_isomap`, `lp_leig`, and
`lp_mce`).
- The Hierarchical Random Graph method, which searches the space of all possible
dendrograms of a network for the ones that best fit its hierarchical structure.
Non-adjacent node pairs that have high average probability of being connected
within these dendrograms are considered good candidates for interaction
(`lp_hrg`).
- The Structural Perturbation Method, which is based on the hypothesis that
links are predictable if removing them has only small effects on network
structure (`lp_spm`).
In addition, `LinkPrediction` includes function `lp_matrix` and `get_non_edges`,
which are key to extend the package with other link prediction approaches (more
on this below).
Finally, the performance of these link predictors on a network of interest can
be assessed and visualised with functions `prune_recover` (which computes all
the above-mentioned evaluation metrics) and `plot_lp_performance`.
For more information on the link prediction problem, see:
- Martínez, V., Berzal, F. & Cubero, J.-C. A survey of link prediction in
complex networks. *ACM Computing Surveys* **49**, 1-33 (2016).
- Lü, L., Pan, L., Zhou, T., Zhang, Y.-C. & Stanley, H. E. Toward link
predictability of complex networks. *PNAS* **112**, 2325-2330 (2015).
# Installation
1. Install the `devtools` package from CRAN if you haven't done so:
```r
install.packages("devtools")
```
2. Load the `devtools` package:
```r
library("devtools")
```
3. Install `LinkPrediction` using the `install_github` function:
```r
install_github("galanisl/LinkPrediction")
```
# Usage
```{r setup, include=FALSE}
library(dplyr)
```
To start using `LinkPrediction`, load the package:
```{r message=FALSE, warning=FALSE}
library("LinkPrediction")
```
`LinkPrediction` includes two complex network datasets (type `?karate_club` and
`?jazz_collab` in R for more information). Let's use the Jazz Collaboration
network to illustrate the use of the package.
First, let's apply the Common Neighbours link predictor to the network:
```{r message=FALSE, warning=FALSE}
(cn <- lp_cn(jazz_collab))
```
Note how the lp_* functions return a `tibble` of candidate links with their
likelihoods of interaction. Let's now compute the structural consistency
$\sigma_c \in [0, 1]$ of this network. The higher it is, the higher its link
predictability:
```{r}
sigma_c <- structural_consistency(jazz_collab)
mean(sigma_c)
sd(sigma_c)
```
Since the $\sigma_c$ is high, link predictors are likely to give us good
candidates of interaction.
Let's now implement a link predictor of our own with the help of function
`get_non_edges`. Our method will return a tibble with a random ordering of
the disconnected node pairs in the network:
```{r}
lp_rnd <- function(g){
non_edges <- get_non_edges(g)
prediction <- tibble(nodeA = non_edges[, 1], nodeB = non_edges[, 2],
scr = sample(nrow(non_edges))) %>%
arrange(desc(scr))
return(prediction)
}
```
Finally, let's compare our method with other link predictors and visualise the
results:
```{r}
assessment <- prune_recover(jazz_collab,
"lp_rnd",
"lp_cn",
"lp_car",
"lp_spm")
plot_lp_performance(assessment, err = "sd")
```
# Session information
```{r}
sessionInfo()
```