Science: paper
Pipeline:
- Collect GitHub repositories you wish to process. E.g. from Public Git Archive.
- Produce BOW (bag-of-words) model from the identifiers inside those repositories.
- Convert the BOW model to Vowpal Wabbit format.
- Convert Vowpal Wabbit dataset to BigARTM batches.
- Train the topic model using BigARTM.
- Convert the result to
Topics
ASDF.
Ensure that the Babelfish server is running.
srcml repos2bow -f id -x repo -l Java Python Ruby --min-docfreq 5 --persist DISK_ONLY --docfreq docfreq.asdf --bow bow.asdf
Change "Java Python Ruby" to any list of languages you want to process and are annotated by Babelfish. It is possible to run it on a Spark cluster, in that case specify --spark
.
srcml bow2vw --bow bow.asdf -o vw_dataset.txt
We transform the merged BOW model stored in ASDF binary format to simple text "Vowpal Wabbit" format.
The reason we use the intermediate format is that BigARTM's Python API is much slower at the direct conversion.
You will need a working bigartm
command-line application. The following command should install bigartm
to the current working directory, provided by you have all the dependencies present in the system.
srcml bigartm
The actual conversion:
./bigartm -c vw_dataset.txt -p 0 --save-batches artm_batches --save-dictionary artm_batches/artm.dict
Stage 1 performs the main optimization:
./bigartm --use-batches artm_batches --use-dictionary artm_batches/artm.dict -t 256 -p 20 --threads 4 --rand-seed 777 --regularizer "1000 Decorrelation" --save-model stage1.bigartm
Stage 2 optimizes for sparsity:
./bigartm --use-batches artm_batches --use-dictionary artm_batches/artm.dict --load-model stage1.bigartm -p 10 --threads 4 --rand-seed 777 --regularizer "1000 Decorrelation" "0.5 SparsePhi" "0.5 SparseTheta" --save-model stage2.bigartm
We set the number of topics to 256, the number of workers to 4. -p
sets the number of iterations (passes). Choosing the stages and the regularizers is an art. Please refer to BigARTM papers.
First we convert the model to the text format:
./bigartm --use-batches artm_batches --use-dictionary artm_batches/artm.dict --load-model stage2.bigartm -p 0 --write-model-readable readable_stage2.txt
Second we convert the text format to the ASDF:
srcml bigartm2asdf readable_stage2.txt topic_model.asdf