Skip to content

Commit

Permalink
README indents
Browse files Browse the repository at this point in the history
  • Loading branch information
adithya8 committed Apr 26, 2021
1 parent 9c83a9d commit 6e23ebb
Show file tree
Hide file tree
Showing 2 changed files with 18 additions and 6 deletions.
21 changes: 16 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@ The DLATK first requires to tokenize the messages before generating the embeddin

The embedding generation command:

CUDA_VISIBLE_DEVICES=0 python3 dlatkInterface.py -d db -t table_name -c user_id --add_emb --emb_model roberta-base --emb_layers 11 --emb_msg_aggregation mean --batch_size 30
CUDA_VISIBLE_DEVICES=0 python3 dlatkInterface.py -d db -t table_name -c user_id --add_emb \
--emb_model roberta-base --emb_layers 11 --emb_msg_aggregation mean --batch_size 30

table_name = {D_20, T_20}

Expand All @@ -20,15 +21,19 @@ table_name = {D_20, T_20}

The DLATK command to extract the dimension reduction is done in two steps as explained in the report. The first step involved learning the reduction on the domain data and storing the learnt model in a pickle file:

python3 dlatkInterface.py -d db -t table_name -c user_id --group_freq_thresh 1000 -f 'feat$roberta_ba_meL11con$table_name$user_id$16to16' --model {dimred_model} --fit_reducer --k 128 --save_model --picklefile dimred_model_128.pickle
python3 dlatkInterface.py -d db -t table_name -c user_id --group_freq_thresh 1000 \
-f 'feat$roberta_ba_meL11con$table_name$user_id$16to16' --model {dimred_model} \
--fit_reducer --k 128 --save_model --picklefile dimred_model_128.pickle

The number of dimensions to reduce to (components) can be changed by altering the argument of `--k`

The dimred_model here could be `pca, nmf, fa, ae` (for non linear auto-encoders). For ae, it is expected to prefix the command with CUDA_VISIBLE_DEVICES env variable to specify GPU.

The second step would be applying this learnt reduction model on the task data to generate the reduced representations.

python3 dlatkInterface.py -d db -t table_name -c user_id --group_freq_thresh 1000 -f 'feat$roberta_ba_meL11con$table_name$user_id$16to16' --transform_to_feats {dimred_table_name} --load --picklefile dimred_model_128.pickle
python3 dlatkInterface.py -d db -t table_name -c user_id --group_freq_thresh 1000 \
-f 'feat$roberta_ba_meL11con$table_name$user_id$16to16' \
--transform_to_feats {dimred_table_name} --load --picklefile dimred_model_128.pickle

The name of the table to stored the dimension reduced representations is given in pace of dimred_table_name.

Expand All @@ -38,11 +43,17 @@ The name of the table to stored the dimension reduced representations is given i

The commands to perform bootstrapped training followed by evaluation for regression task is given by:

python3 dlatkInterface.py -d db -t task_table_name -c user_id --group_freq_thresh 1000 -f '{feat_table_name}' --outcome_table 20_outcomes --outcomes age ext ope --train_reg --model ridgehighcv --train_bootstraps 10 --where 'r10pct_test_fold is NOT NULL' --train_bootstraps_ns 50 100 200 500 1000 --no_standardize --save_models --picklefile reg_model_{feat_table_name}.pickle
python3 dlatkInterface.py -d db -t task_table_name -c user_id --group_freq_thresh 1000 \
-f '{feat_table_name}' --outcome_table 20_outcomes --outcomes age ext ope --train_reg \
--model ridgehighcv --train_bootstraps 10 --where 'r10pct_test_fold is NOT NULL' \
--train_bootstraps_ns 50 100 200 500 1000 --no_standardize \
--save_models --picklefile reg_model_{feat_table_name}.pickle

The feat table name is either the raw embeddings table name or the dimension reduced feature table name. The regression outcomes are listed in the `--outcomes` flag. The number of times to perform the bootstrapping is specified in `--train_bootstraps` flag and the sample sizes for bootstrapping is specifed in `--train_bootstrap_ns` flag. Finally the ridge model is saved as a pickle file.

python3 ~/NLP/dlatk/dlatkInterface.py -d db-t table_name -c user_id --group_freq_thresh 1000 -f '{feat_table_name}' --outcome_table 20_outcomes --outcomes age ext ope --predict_reg --where 'facet_fold = 1' --load --picklefile reg_model_{feat_table_name}.pickle > output.txt
python3 ~/NLP/dlatk/dlatkInterface.py -d db-t table_name -c user_id --group_freq_thresh 1000 \
-f '{feat_table_name}' --outcome_table 20_outcomes --outcomes age ext ope --predict_reg --where \
'facet_fold = 1' --load --picklefile reg_model_{feat_table_name}.pickle > output.txt

This command would store the evaluation result for the ten runs in output.txt.

Expand Down
3 changes: 2 additions & 1 deletion models/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,8 @@ Here is an example showing how:

If the user embeddings have been generated using [DLATK](https://github.com/DLATK/DLATK/) by following the commands [here](https://github.com/adithya8/ContextualEmbeddingDR#commands-to-extract-dimension-reduced-tables-using-a-specific-method), then you can use these pickle files directly by using the following command:

python dlatkInterface.py -d {database-name} -t {table-name} -g {group-name} -f {user-embeddings-table-name} --transform_to_feats {dimred-table-name} --load --pickle {path-to-pickle-file}
python dlatkInterface.py -d {database-name} -t {table-name} -g {group-name} -f {user-embeddings-table-name} \
--transform_to_feats {dimred-table-name} --load --pickle {path-to-pickle-file}


### **Using CSVs through python**
Expand Down

0 comments on commit 6e23ebb

Please sign in to comment.