-
Clone this repo
git clone https://github.com/nddsg/HDTM.git
-
Configure
GraphLab
projectcd HDTM/src/graphlab/ ./configure
-
Compile
HDTM
cd ./release/apps make -j2
--vertices The file contains article vertices
--edges The file contains article edges
--root Id of root article
--gamma Gamma for RWR calculation, default 0.25
--alpha Alpha, default 10
--eta Eta, default 0.1
--iter Iter, default 5
--burn Burn-in, default 100
--sample Sample interval, default 10
--token Number of tokens, run wc -l on dictionary file
--prefix Prefix of binary graph files
--load Load graph from binary files
Example:
./hdtm --vertices /data/bshi/wikipedia/vertices.txt --edges /data/bshi/wikipedia/edges.txt \
--root 26700 --token 6064216 --iter 400 --burn 100 --sample 50 \
--prefix /data/bshi/wikipedia/graphlab/result/
HDTM requires:
-
a vertices file with a vertex id, a tab and a list of words id space separated
4 3 4 5 6 7 5 3 4 5 7 8 6 8 10 12 14 ....
-
edges file with vertex id, tab and vertex id
4 5
4 6
```
Actually, the algorithm requires that the words of the articles should be converted as int and that
the values 0 and 1 have not to be used since are utilized for `empty_key` and `deleted_key`.
When `hdtm` is finished, you will get a set of results that looks like this
bshi@dsg1:/data/bshi/wikipedia/graphlab/result$ ls
0_0_0.bin 140_0_0.bin 190_0_0.bin 240_0_0.bin 290_0_0.bin 340_0_0.bin 390_-1.89962e+14_0.bin
100_0_0.bin 150_0_0.bin 200_0_0.bin 250_0_0.bin 300_0_0.bin 350_0_0.bin
110_0_0.bin 160_0_0.bin 210_0_0.bin 260_0_0.bin 310_0_0.bin 360_0_0.bin
120_0_0.bin 170_0_0.bin 220_0_0.bin 270_0_0.bin 320_0_0.bin 370_0_0.bin
130_0_0.bin 180_0_0.bin 230_0_0.bin 280_0_0.bin 330_0_0.bin 380_0_0.bin
The first number in file names is the iteration number.
### Convert `HDTM` binary graph to text file
./hdtm_ana BINARY_FILE.bin ORIGINAL_GRAPH_OUT HIERARCHY_OUT NODE_CHANGES_OUT
* `ORIGINAL_GRAPH_OUT` is a baseline tree generated from input graph. Namely if in the original input graph a node has more than one in-coming edges, it will randomly pick one and discards the rest to construct a tree. The output file format is `src \t dst`.
* `HIERARCHY_OUT` is the result hierarchy. This is also a tree but the parent are selected based on HDTM's result. The output format is `src \t height \t dst`.
* `NODE_CHANGES_OUT` is the change log of every node in original graph. The output format is `node_id \t (#current_parent_been_chosen / #iterations) \t #in-coming_edges(parent candidates)`
## Data analysis
All scripts used in this project is under `scripts` folder. You can use `data_preprocess.py` to reproduce the result.
Graph generation is done in `R`, and you can find the code under `analysis` folder. You can find R package `rdsg` at http://github.com/bxshi/rdsg.