Skip to content
/ HDTM Public

Hierarchical Document Topic Model

Notifications You must be signed in to change notification settings

nddsg/HDTM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hierarchical Document Topic Model

Install

  1. Clone this repo

     git clone https://github.com/nddsg/HDTM.git
    
  2. Configure GraphLab project

     cd HDTM/src/graphlab/
     ./configure
    
  3. Compile HDTM

     cd ./release/apps
     make -j2
    

Execution

HDTM arguments

	--vertices	The file contains article vertices
	--edges	The file contains article edges
	--root	Id of root article
	--gamma	Gamma for RWR calculation, default 0.25
	--alpha	Alpha, default 10
	--eta	Eta, default 0.1
	--iter	Iter, default 5
	--burn	Burn-in, default 100
	--sample	Sample interval, default 10
	--token	Number of tokens, run wc -l on dictionary file
	--prefix	Prefix of binary graph files
	--load	Load graph from binary files

Example:

	./hdtm --vertices /data/bshi/wikipedia/vertices.txt --edges /data/bshi/wikipedia/edges.txt \
	       --root 26700 --token 6064216 --iter 400 --burn 100 --sample 50 \
				 --prefix /data/bshi/wikipedia/graphlab/result/

HDTM requires:

  • a vertices file with a vertex id, a tab and a list of words id space separated

    4	3 4 5 6 7
    5	3 4 5 7 8
    6	8 10 12 14
    ....
    
  • edges file with vertex id, tab and vertex id

4	5
4	6
  ```

Actually, the algorithm requires that the words of the articles should be converted as int and that
the values 0 and 1 have not to be used since are utilized for `empty_key` and `deleted_key`.


When `hdtm` is finished, you will get a set of results that looks like this

  	bshi@dsg1:/data/bshi/wikipedia/graphlab/result$ ls
  	0_0_0.bin    140_0_0.bin  190_0_0.bin  240_0_0.bin  290_0_0.bin  340_0_0.bin  390_-1.89962e+14_0.bin
  	100_0_0.bin  150_0_0.bin  200_0_0.bin  250_0_0.bin  300_0_0.bin  350_0_0.bin
  	110_0_0.bin  160_0_0.bin  210_0_0.bin  260_0_0.bin  310_0_0.bin  360_0_0.bin
  	120_0_0.bin  170_0_0.bin  220_0_0.bin  270_0_0.bin  320_0_0.bin  370_0_0.bin
  	130_0_0.bin  180_0_0.bin  230_0_0.bin  280_0_0.bin  330_0_0.bin  380_0_0.bin

The first number in file names is the iteration number.

### Convert `HDTM` binary graph to text file

  	./hdtm_ana BINARY_FILE.bin ORIGINAL_GRAPH_OUT HIERARCHY_OUT NODE_CHANGES_OUT

* `ORIGINAL_GRAPH_OUT` is a baseline tree generated from input graph. Namely if in the original input graph a node has more than one in-coming edges, it will randomly pick one and discards the rest to construct a tree. The output file format is `src \t dst`.
* `HIERARCHY_OUT` is the result hierarchy. This is also a tree but the parent are selected based on HDTM's result. The output format is `src \t height \t dst`.
* `NODE_CHANGES_OUT` is the change log of every node in original graph. The output format is `node_id \t (#current_parent_been_chosen / #iterations) \t #in-coming_edges(parent candidates)`

## Data analysis

All scripts used in this project is under `scripts` folder. You can use `data_preprocess.py` to reproduce the result.

Graph generation is done in `R`, and you can find the code under `analysis` folder. You can find R package `rdsg` at http://github.com/bxshi/rdsg.