-
-
Notifications
You must be signed in to change notification settings - Fork 3
/
part-01-09-netdiffuser.qmd
622 lines (447 loc) · 18.8 KB
/
part-01-09-netdiffuser.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
---
date-modified: 2024-05-10
---
# Network diffusion
This chapter is based on the 2018 and 2019 tutorials of netdiffuseR
at the Sunbelt conference. The source code of the tutorials, taught by [Thomas W. Valente](https://keck.usc.edu/faculty-search/thomas-william-valente/)
and [George G. Vega Yon](https://ggvy.cl) (author of this book), is available [here](https://github.com/USCCANA/netdiffuser-sunbelt2018/tree/sunbelt2019).
## Network diffusion of innovation
### Diffusion networks
```{r setup-netdiffuseR, echo=FALSE, message=FALSE, warning=FALSE}
library(netdiffuseR)
```
<img source="https://github.com/USCCANA/netdiffuser-sunbelt2018/blob/sunbelt2019/valente_1995.jpg?raw=true" style="with: 200px"></img>
* Explains how new ideas and practices (innovations) spread within and between
communities.
* While a lot of factors have been shown to influence diffusion (Spatial,
Economic, Cultural, Biological, etc.), Social Networks are prominent.
* There are many components in the diffusion network model, including network exposures, thresholds, infectiousness, susceptibility, hazard rates, diffusion rates (bass model), clustering (Moran's I), and so on.
### Thresholds
* One of the canonical concepts is the network threshold. Network thresholds (Valente, 1995; 1996), $\tau$, are defined as the required proportion or number of neighbors that lead you to adopt a particular behavior (innovation), $a=1$. In (very) general terms
$$
a_i = \left\{\begin{array}{ll}
1 &\text{if } \tau_i\leq E_i \\
0 & \text{Otherwise}
\end{array}\right. \qquad
E_i \equiv \frac{\sum_{j\neq i}\mathbf{X}_{ij}a_j}{\sum_{j\neq i}\mathbf{X}_{ij}}
$$
Where $E_i$ is i's exposure to the innovation and $\mathbf{X}$ is the adjacency matrix (the network).
* This can be generalized and extended to include covariates and other network weighting schemes (that's what __netdiffuseR__ is all about).
## The netdiffuseR R package
### Overview
__netdiffuseR__ is an R package that:
* It is designed to Visualize, Analyze, and simulate network diffusion data (in general).
* Depends on some pretty popular packages:
* _RcppArmadillo_: So it's fast,
* _Matrix_: So it's big,
* _statnet_ and _igraph_: So it's not from scratch
* Can handle big graphs, e.g., an adjacency matrix with more than 4 billion elements (PR for RcppArmadillo)
* Already on CRAN with ~6,000 downloads since its first version, Feb 2016,
* A lot of features to make it easy to read network (dynamic) data, making it a companion of other net packages.
### Datasets
- __netdiffuseR__ has the three classic Diffusion Network Datasets:
- `medInnovationsDiffNet` Doctors and the innovation of Tetracycline (1955).
- `brfarmersDiffNet` Brazilian farmers and the innovation of Hybrid Corn Seed (1966).
- `kfamilyDiffNet` Korean women and Family Planning methods (1973).
```{r}
#| label: printing
brfarmersDiffNet
medInnovationsDiffNet
kfamilyDiffNet
```
### Visualization methods
```{r}
#| label: viz
#| cache: true
set.seed(12315)
x <- rdiffnet(
400, t = 6, rgraph.args = list(k=6, p=.3),
seed.graph = "small-world",
seed.nodes = "central", rewire = FALSE, threshold.dist = 1/4
)
plot(x)
plot_diffnet(x)
plot_diffnet2(x)
plot_adopters(x)
plot_threshold(x)
plot_infectsuscep(x, K=2)
plot_hazard(x)
```
### Problems
1. Using the diffnet object in [`intro.rda`](intro.rda), use the function `plot_threshold` specifying shapes and colors according to the variables ItrustMyFriends and Age. Do you see any pattern?
## Simulation of diffusion processes
```{r setup-sim, echo=FALSE, message=FALSE, warning=FALSE}
library(netdiffuseR)
```
Before we start, a review of the concepts we will be using here
1. Exposure: Proportion/number of neighbors that have adopted an innovation at each point in time.
2. Threshold: The proportion/number of your neighbors who had adopted at or one time period before ego (the focal individual) adopted.
3. Infectiousness: How much $i$'s adoption affects her alters.
4. Susceptibility: How much $i$'s alters' adoption affects her.
5. Structural equivalence: How similar is $i$ to $j$ in terms of position in the network.
### Simulating diffusion networks
We will simulate a diffusion network with the following parameters:
1. Will have 1,000 vertices,
2. Will span 20 time periods,
3. The initial adopters (seeds) will be selected at random,
4. Seeds will be a 10\% of the network,
5. The graph (network) will be small-world,
6. Will use the WS algorithm with $p=.2$ (probability of rewiring).
7. Threshold levels will be uniformly distributed between [0.3, 0.7\]
To generate this diffusion network, we can use the `rdiffnet` function included in the package:
```{r Generating the random graph}
# Setting the seed for the RNG
set.seed(1213)
# Generating a random diffusion network
net <- rdiffnet(
n = 1e3, # 1.
t = 20, # 2.
seed.nodes = "random", # 3.
seed.p.adopt = .1, # 4.
seed.graph = "small-world", # 5.
rgraph.args = list(p=.2), # 6.
threshold.dist = function(x) runif(1, .3, .7) # 7.
)
```
* The function `rdiffnet` generates random diffusion networks. Main features:
1. Simulating random graph or using your own,
2. Setting threshold levels per node,
3. Network rewiring throughout the simulation, and
4. Setting the seed nodes.
* The simulation algorithm is as follows:
1. If required, a baseline graph is created,
2. Set of initial adopters and threshold distribution are established,
3. The set of t networks is created (if required), and
4. Simulation starts at t=2, assigning adopters based on exposures and thresholds:
a. For each $i \in N$, if its exposure at $t-1$ is greater than its threshold, then adopts, otherwise, continue without change.
b. next $i$
### Rumor spreading
```{r sim-rumor}
library(netdiffuseR)
set.seed(09)
diffnet_rumor <- rdiffnet(
n = 5e2,
t = 5,
seed.graph = "small-world",
rgraph.args = list(k = 4, p = .3),
seed.nodes = "random",
seed.p.adopt = .05,
rewire = TRUE,
threshold.dist = function(i) 1L,
exposure.args = list(normalized = FALSE)
)
```
```{r summary-rumor}
summary(diffnet_rumor)
```
```{r plot-rumor, fig.align='center', cache=TRUE}
plot_diffnet(diffnet_rumor, slices = c(1, 3, 5))
# We want to use igraph to compute layout
igdf <- diffnet_to_igraph(diffnet_rumor, slices=c(1,2))[[1]]
pos <- igraph::layout_with_drl(igdf)
plot_diffnet2(diffnet_rumor, vertex.size = dgr(diffnet_rumor)[,1], layout=pos)
```
### Difussion
```{r sim-complex}
set.seed(09)
diffnet_complex <- rdiffnet(
seed.graph = diffnet_rumor$graph,
seed.nodes = which(diffnet_rumor$toa == 1),
rewire = FALSE,
threshold.dist = function(i) rbeta(1, 3, 10),
name = "Diffusion",
behavior = "Some social behavior"
)
```
```{r plot-complex-and-disease}
plot_adopters(diffnet_rumor, what = "cumadopt", include.legend = FALSE)
plot_adopters(diffnet_complex, bg="tomato", add=TRUE, what = "cumadopt")
legend("topleft", legend = c("Disease", "Complex"), col = c("lightblue", "tomato"),
bty = "n", pch=19)
```
### Mentor Matching
```{r mentor-match, cache = TRUE}
# Finding mentors
mentors <- mentor_matching(diffnet_rumor, 25, lead.ties.method = "random")
# Simulating diffusion with these mentors
set.seed(09)
diffnet_mentored <- rdiffnet(
seed.graph = diffnet_complex,
seed.nodes = which(mentors$`1`$isleader),
rewire = FALSE,
threshold.dist = diffnet_complex[["real_threshold"]],
name = "Diffusion using Mentors"
)
summary(diffnet_mentored)
```
```{r toa_mat-mentors}
cumulative_adopt_count(diffnet_complex)
cumulative_adopt_count(diffnet_mentored)
```
### Example by changing threshold
```{r sim-sim, cache = TRUE, collapse = TRUE}
# Simulating a scale-free homophilic network
set.seed(1231)
X <- rep(c(1,1,1,1,1,0,0,0,0,0), 50)
net <- rgraph_ba(t = 499, m=4, eta = X)
# Taking a look in igraph
ig <- igraph::graph_from_adjacency_matrix(net)
plot(ig, vertex.color = c("azure", "tomato")[X+1], vertex.label = NA,
vertex.size = sqrt(dgr(net)))
# Now, simulating a bunch of diffusion processes
nsim <- 500L
ans_1and2 <- vector("list", nsim)
set.seed(223)
for (i in 1:nsim) {
# We just want the cum adopt count
ans_1and2[[i]] <-
cumulative_adopt_count(
rdiffnet(
seed.graph = net,
t = 10,
threshold.dist = sample(1:2, 500L, TRUE),
seed.nodes = "random",
seed.p.adopt = .10,
exposure.args = list(outgoing = FALSE, normalized = FALSE),
rewire = FALSE
)
)
# Are we there yet?
if (!(i %% 50))
message("Simulation ", i," of ", nsim, " done.")
}
# Extracting prop
ans_1and2 <- do.call(rbind, lapply(ans_1and2, "[", i="prop", j=))
ans_2and3 <- vector("list", nsim)
set.seed(223)
for (i in 1:nsim) {
# We just want the cum adopt count
ans_2and3[[i]] <-
cumulative_adopt_count(
rdiffnet(
seed.graph = net,
t = 10,
threshold.dist = sample(2:3, 500L, TRUE),
seed.nodes = "random",
seed.p.adopt = .10,
exposure.args = list(outgoing = FALSE, normalized = FALSE),
rewire = FALSE
)
)
# Are we there yet?
if (!(i %% 50))
message("Simulation ", i," of ", nsim, " done.")
}
ans_2and3 <- do.call(rbind, lapply(ans_2and3, "[", i="prop", j=))
```
We can simplify by using the function `rdiffnet_multiple`. The following lines of code accomplish the same as the previous code avoiding the for-loop (from the user's perspective). Besides of the usual parameters passed to `rdiffnet`, the `rdiffnet_multiple` function requires `R` (number of repetitions/simulations), and `statistic` (a function that returns the statistic of interest). Optionally, the user may choose to specify the number of clusters to run it in parallel (multiple CPUs):
```{r rdiffnet-multiple}
ans_1and3 <- rdiffnet_multiple(
# Num of sim
R = nsim,
# Statistic
statistic = function(d) cumulative_adopt_count(d)["prop",],
seed.graph = net,
t = 10,
threshold.dist = sample(1:3, 500, TRUE),
seed.nodes = "random",
seed.p.adopt = .1,
rewire = FALSE,
exposure.args = list(outgoing=FALSE, normalized=FALSE),
# Running on 4 cores
ncpus = 4L
)
```
```{r sim-sim-results}
boxplot(ans_1and2, col="ivory", xlab = "Time", ylab = "Threshold")
boxplot(ans_2and3, col="tomato", add=TRUE)
boxplot(t(ans_1and3), col = "steelblue", add=TRUE)
legend(
"topleft",
fill = c("ivory", "tomato", "steelblue"),
legend = c("1/2", "2/3", "1/3"),
title = "Threshold range",
bty ="n"
)
```
<!-- * Example simulating a thousand networks by changing threshold levels.
The final prevalence, or hazard as a function of threshold levels. -->
### Problems
1. Given the following types of networks: Small-world, Scale-free, Bernoulli, what set of $n$ initiators maximizes diffusion?
<!-- (<a href="sim-solutions.r" target="_blank">solution script</a> and <a href="sim-solutions.png" target="_blank">solution plot</a>) -->
## Statistical inference
### Moran's I
* Moran's I tests for spatial autocorrelation.
* __netdiffuseR__ implements the test in `moran`, which is suited for sparse matrices.
* We can use Moran's I as a first look to whether there is something happening: let that be influence or homophily.
### Using geodesics
* One approach is to use the geodesic (shortest path length) matrix to account for indirect influence.
* __netdiffuseR__ has a function to do so, the `approx_geodesic` function, which, using graph powers, computes the shortest path up to `n` steps. This could be faster (if you only care up to `n` steps) than `igraph` or `sns`:
```{r geodesic_speed, cache=TRUE}
# Extracting the large adjacency matrix (stacked)
dgc <- diag_expand(medInnovationsDiffNet$graph)
ig <- igraph::graph_from_adjacency_matrix(dgc)
mat <- network::as.network(as.matrix(dgc))
# Measuring times
times <- microbenchmark::microbenchmark(
netdiffuseR = netdiffuseR::approx_geodesic(dgc),
igraph = igraph::distances(ig),
sna = sna::geodist(mat),
times = 50, unit="ms"
)
```
```{r geodesic_speed-box, autodep=TRUE, echo=FALSE}
microbenchmark:::boxplot.microbenchmark(times)
```
* The `summary.diffnet` method already runs Moran's for you. What happens under the hood is:
```{r}
# For each time point we compute the geodesic distances matrix
W <- approx_geodesic(medInnovationsDiffNet$graph[[1]])
# We get the element-wise inverse
W@x <- 1/W@x
# And then compute moran
moran(medInnovationsDiffNet$cumadopt[,1], W)
```
### Structural dependence and permutation tests
- A novel statistical method (work-in-progress) for testing network influence effects.
- Included in the package, tests whether a particular network statistic depends on network structure
- Suitable to be applied to network thresholds (you can't use thresholds in regression-like models!)
### Idea
- Let $\mathcal{G} = (V,E)$ be a graph, $\gamma$ a vertex attribute, and $\beta = f(\gamma,\mathcal{G})$, then
$$\gamma \perp \mathcal{G} \implies \mathbb{E}\left[\beta(\gamma,\mathcal{G})|\mathcal{G}\right] = \mathbb{E}\left[\beta(\gamma,\mathcal{G})\right]$$
- For example, if time of adoption is independent of the structure of the network, then the average threshold level will be independent from the network structure as well.
- Another way of looking at this is that the test will allow us to see how probable it is to have this combination of network structure and network threshold (if it is uncommon, then we say that the diffusion model is highly likely)
#### Example Not random TOA
- To use this test, __netdiffuseR__ has the `struct_test` function.
- It simulates networks with the same density, and computes a particular statistic every time, generating an EDF (Empirical Distribution Function) under the Null hypothesis (p-values).
```{r Struct non-random-toa, cache=TRUE}
# Simulating network
set.seed(1123)
net <- rdiffnet(n=500, t=10, seed.graph = "small-world")
# Running the test
test <- struct_test(
graph = net,
statistic = function(x) mean(threshold(x), na.rm = TRUE),
R = 1e3,
ncpus=4, parallel="multicore"
)
# See the output
test
```
```{r, echo=FALSE}
hist(test)
```
- Now we shuffle times of adoption, so that is random
```{r random-toa, cache=TRUE}
# Resetting TOAs (now will be completely random)
diffnet.toa(net) <- sample(diffnet.toa(net), nnodes(net), TRUE)
# Running the test
test <- struct_test(
graph = net,
statistic = function(x) mean(threshold(x), na.rm = TRUE),
R = 1e3,
ncpus=4, parallel="multicore"
)
# See the output
test
```
```{r, echo=FALSE}
hist(test)
```
### Regression analysis
* In regression analysis, we want to see if exposure, once we control for other covariates, had any effect on adopting a behavior.
* The big problem is when we have a latent variable that co-determines both network and behavior.
* Regression analysis will be generically biased Unless we can control for that variable.
* On the other hand, if you can claim that either such variable doesn't exist or you actually can control for it, then we have two options: lagged exposure models or contemporaneous exposure models. We will focus on the former.
#### Lagged exposure models
* In this type of model, we usually have the following
$$
y_t = f(W_{t-1}, y_{t-1}, X_i) + \varepsilon
$$
Furthermore, in the case of adoption, we have
$$
y_{it} = \left\{
\begin{array}{ll}
1 & \text{if}\quad \rho\sum_{j\neq i}\frac{W_{ijt-1}y_{it-1}}{\sum_{j\neq i}W_{ijt-1}} + X_{it}\beta > 0\\
0 & \text{otherwise}
\end{array}
\right.
$$
* In netdiffuseR, it is as easy as doing the following:
```{r dataforreg}
# fakedata
set.seed(121)
W <- rgraph_ws(1e3, 8, .2)
X <- cbind(var1 = rnorm(1e3))
toa <- sample(c(NA,1:5), 1e3, TRUE)
dn <- new_diffnet(W, toa=toa, vertex.static.attrs = X)
# Computing exposure and adoption for regression
dn[["cohesive_expo"]] <- cbind(NA, exposure(dn)[,-nslices(dn)])
dn[["adopt"]] <- dn$cumadopt
# Generating the data and running the model
dat <- as.data.frame(dn)
ans <- glm(adopt ~ cohesive_expo + var1 + factor(per),
data = dat,
family = binomial(link="probit"),
subset = is.na(toa) | (per <= toa))
summary(ans)
```
Alternatively, we could have used the new function `diffreg`
```{r}
ans <- diffreg(dn ~ exposure + var1 + factor(per), type = "probit")
summary(ans)
```
#### Contemporaneous exposure models
* Similar to the lagged exposure models, we usually have the following
$$
y_t = f(W_t, y_t, X_t) + \varepsilon
$$
Furthermore, in the case of adoption, we have
$$
y_{it} = \left\{
\begin{array}{ll}
1 & \text{if}\quad \rho\sum_{j\neq i}\frac{W_{ijt}y_{it}}{\sum_{j\neq i}W_{ijt}} + X_{it}\beta > 0\\
0 & \text{otherwise}
\end{array}
\right.
$$
* Unfortunately, since $y_t$ is in both sides of the equation, these models cannot be fitted using a standard probit or logit regression.
* Two alternatives to solve this:
a. Using Instrumental Variables Probit (ivprobit in both R and Stata)
b. Use a Spatial Autoregressive (SAR) Probit (SpatialProbit and ProbitSpatial in R).
* We won't cover these here.
### Problems
Using the dataset [stats.rda](stats.rda):
1. Compute Moran's I as the function `summary.diffnet` does. To do so, you'll need to use the function `toa_mat` (which calculates the cumulative adoption matrix), and `approx_geodesic` (which computes the geodesic matrix). (see `?summary.diffnet` for more details).
2. Read the data as diffnet object, and fit the following logit model $adopt = Exposure*\gamma + Measure*\beta + \varepsilon$. What happens if you exclude the time-fixed effects?
<!-- (<a href="stats-solutions.r" target="_blank">solution script</a>) -->
<!-- ```{r datagen, echo=FALSE, cache=TRUE}
set.seed(1)
n <- 500
nper <- 5
X <- cbind(Measure=rnorm(n))
y <- cbind(sample(c(0, 1), n, TRUE, prob = c(.9, .1)))
# Baseline network
W <- (rgraph_ws(n, k=8, p = .2))
sim_space <- function(W, y, X, pers = 4, lag = FALSE, rho = .2, beta=.5) {
W <- as.matrix(W)
W <- W/(rowSums(W) + 1e-20)
n <- nrow(W)
for (i in 1:pers) {
if (!lag)
ynew <- (solve(diag(n) - rho*W) %*% (X*beta) + rnorm(n)) > 0
else
ynew <- (rho * (W %*% y[,i - as.integer(i != 1),drop=FALSE]) + beta*X + rnorm(n)) > 0
y <- cbind(y, ifelse(
y[,i - as.integer(i != 1),drop=FALSE] == 1,
y[,i - as.integer(i != 1),drop=FALSE],
ynew)
)
}
y
}
ans <- sim_space(W, y, X, pers = nper, lag=TRUE)
toa <- ncol(ans) - apply(ans, 1, sum)
X <- cbind(X, toa=ifelse(toa == 0, NA, toa))
save(X, W, file="stats.rda")
``` -->