-
Notifications
You must be signed in to change notification settings - Fork 2
/
train-infer-mallet.sh
executable file
·44 lines (33 loc) · 1.7 KB
/
train-infer-mallet.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
## Usage
# ./train-infer-mallet.sh NAME NUMTOPICS
# Where:
# NAME - the name of the work as saved in /tmp/train-{{NAME}}.txt
# NUMTOPICS - the number of topics to build the model for
NAME=$1
NUM_TOPICS=$2
#MALLET_HOME=mallet-2.0.7
: ${MALLET_HOME:?"Need to set MALLET_HOME before running this script. e.g. export MALLET_HOME=/mallet-2.0.7"}
TRAIN=tmp/train-$NAME.txt
INFER=tmp/infer-$NAME.txt
### Index single-page docs (training) for Mallet
$MALLET_HOME/bin/mallet import-file --input $TRAIN \
--output tmp/singlepage-$NAME.mallet \
--remove-stopwords --keep-sequence TRUE
# Keep-sequence TRUE is a false assumption but necessary to run for programmatic reasons. LDA assumes conditional independence between features.
### Index sliding-frame docs (inference) for Mallet
$MALLET_HOME/bin/mallet import-file --input $INFER \
--output tmp/slidingframe-$NAME.mallet \
--remove-stopwords --keep-sequence TRUE \
--use-pipe-from tmp/singlepage-$NAME.mallet
# It is important to use --use-pipe-from with the single-page mallet, so the inferencer will be compatible.
### Train topics and save inferencer
$MALLET_HOME/bin/mallet train-topics --input tmp/singlepage-$NAME.mallet \
--num-topics $NUM_TOPICS --output-topic-keys tmp/topic_keys.txt \
--optimize-interval 20 --num-iterations 3500 \
--inferencer-filename tmp/book-$NAME.inferencer
# Note that the --num-iterations in this example is quite high, it can be removed or set to the default of 1000. Watch when the log-likelihood goes down.
### Infer topics for sliding page frame
$MALLET_HOME/bin/mallet infer-topics \
--inferencer tmp/book-$NAME.inferencer \
--input tmp/slidingframe-$NAME.mallet \
--output-doc-topics tmp/inferred-pageframe-topics.txt