Skip to content

Commit

Permalink
added older_scripts folder
Browse files Browse the repository at this point in the history
  • Loading branch information
adithya8 committed Apr 24, 2021
2 parents 5abd42b + ff2ca7c commit 4e9d262
Show file tree
Hide file tree
Showing 23 changed files with 1,056 additions and 9 deletions.
21 changes: 12 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,6 @@
# **Empirical Evaluation of Pre-trained Transformers for Human-Level NLP: The Role of Sample Size and Dimensionality**

## *Adithya V Ganesan, 112683104*

----

The FB dataset detailed in the report was used for this project. The data was dumped in MySQL in order to conduct all the experiments using [DLATK](https://github.com/dlatk/dlatk), a language toolkit. This README file contains the command logs to collect the results for the experiments. All the commands except for the transformer embedding generation was executed by switching to the "dev" branch. The said embedding generation commands were executed checkingout to "dev-transformers" branch.

The messages for domain data was stored in table named D_20 and task data in a table named T_20. The outcomes (age, gen, ext, ope) were stored in a table named 20_outcomes. The database will be referred as "db".

----
---

### **Commands to extract RoBERTa embeddings:**

Expand Down Expand Up @@ -57,3 +49,14 @@ This command would store the evaluation result for the ten runs in output.txt.
For classification task the commands have a slight variation. The outcomes fag is changed to appropriate categorical column name. The `--train_reg` and `--predict_reg` are changed to `--train_classifiers` and `--predict_classifiers` respectively.

----

| Number of training samples | Demographic Tasks | Personality Tasks | Mental Health Tasks |
| -------------------------- | :---------------: | :---------------: | :-----------------: |
| 50 | 16 | 16 | 16 |
| 100 | 128 | 16 | 22 |
| 200 | 512 | 32 | 45 |
| 500 | 768 | 64 | 64 |
| 1000 | 768 | 90 | 64 |

This work is intended to inform researchers in Computational Social Science a simple way to improve the performance of transformer based models. We find that training PCA on transformer representations using the domain data improves the model performance overall, with evidence of handling longer sequences better than other reduction methods.
The table above presents a summary of systematic experiments, recommmending the number of dimensions required for given number of samples in each task domain to achieve the best performance.
129 changes: 129 additions & 0 deletions older_scripts/BERTBDimRedExp.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
#author: @adithya8

if [ "$1" -eq "18" ];
then
declare -A db=(["d"]="clp18_adi" ["t"]="tr_a11essays" ["c"]="clp18_id")
declare -A dbTables=(["tr_a11"]="tr_a11essays")
declare -A lexTables=(["tr_a11"]="tr_a11_bertb_")
folderName="clp18_adi"
dimRedModel=$2
msgk=$3
noEval="0"

if [ "$4" -eq "1" ];
then
noEval="1"
fi
elif [ "$1" -eq "19" ];
then
declare -A db=(["d"]="clp19_adi" ["t"]="task_A" ["c"]="user_id")
declare -A dbTables=(["A"]="task_A" ["C"]="task_Cfil" ["At"]="task_A_title" ["Ct"]="task_Cfil_title")
declare -A lexTables=(["A"]="task_A_bert_" ["C"]="task_Cfil_bert_" ["At"]="taskAt_bert_" ["Ct"]="taskCt_bert_")
folderName="clp19_adi"
dimRedModel=$2
msgk=$3
titlek=$4

noEval="0"

if [ "$5" -eq "1" ];
then
noEval="1"
fi
else
declare -A db=(["d"]="fb20_adi" ["t"]="tr_fb" ["c"]="user_id")
declare -A dbTables=(["tr_fb"]="tr_fb")
declare -A lexTables=(["tr_fb"]="tr_fb_bert_")
folderName="fb20_adi"
dimRedModel=$2
msgk=$3
noEval="0"

if [ "$4" -eq "1" ];
then
noEval="1"
fi

fi


declare -A dimRedModels=(["fa"]="fa" ["pca"]="pca" ["nmf"]="nmf")

echo "$1_$2_$3"
echo "${noEval}"

resultFile="BERTb_${dimRedModel}_${msgk}_${titlek}.txt"

echo "$resultFile"

if [ "${noEval}" -eq "0" ];
then
if test -d "~/NLP/ContextualEmbeddingDR/results/${folderName}/BERTb_${dimRedModel}/"; then
echo "Directory Exists"
else
eval "mkdir ~/NLP/ContextualEmbeddingDR/results/${folderName}/BERTb_${dimRedModel}/"
if test -d "~/NLP/ContextualEmbeddingDR/results/${folderName}/BERTb_${dimRedModel}/"; then
echo "Directory Created"
fi
fi
fi

for i in ${!dbTables[@]}
do
dbTable=${dbTables[$i]}
lexTable=${lexTables[$i]}

if [[ $dbTable =~ "title" ]]
then
k=${titlek}
else
k=${msgk}
fi

lexTableCr="python3 ~/dlatk/dlatk/dlatkInterface.py -d ${db["d"]} -t ${dbTable} -c ${db["c"]} --group_freq_thresh 0 -f 'feat\$bert_ba_un_memimaL10co\$${dbTable}\$${db["c"]}\$16to16' --fit_reducer --model ${dimRedModel} --n_components ${k} --reducer_to_lexicon ${lexTable}${dimRedModel}${k}"
echo "${lexTableCr}"

if [ "${noEval}" -eq "0" ];
then
eval "${lexTableCr}"
fi

echo "----------------------------------------------------------------------"
weightLex="python3 ~/dlatk/dlatk/dlatkInterface.py -d ${db["d"]} -t ${dbTable} -c ${db["c"]} --group_freq_thresh 0 --word_table 'feat\$bert_ba_un_memimaL10co\$${dbTable}\$${db["c"]}\$16to16' --add_lex -l ${lexTable}${dimRedModel}${k} --weighted_lex"
echo "${weightLex}"

if [ "${noEval}" -eq "0" ];
then
eval "${weightLex}"
fi

echo "----------------------------------------------------------------------"
done
if [ "$1" -eq "18" ];
then
finalCommand="python3 ~/dlatk/dlatk/dlatkInterface.py -d ${db["d"]} -t ${db["t"]} -c ${db["c"]} --group_freq_thresh 0 -f 'feat\$cat_${lexTables["tr_a11"]}${dimRedModel}${msgk}_w\$${db["t"]}\$${db["c"]}\$bert' --outcome_table tr_variables --outcomes a11_bsag_total --combo_test_reg --model ridgehighcv --folds 10 > ~/NLP/ContextualEmbeddingDR/results/clp${1}_adi/BERTb_${dimRedModel}/${resultFile}"
elif [ "$1" -eq "19" ];
then
finalCommand="python3 ~/dlatk/dlatk/dlatkInterface.py -d ${db["d"]} -t ${db["t"]} -c ${db["c"]} --group_freq_thresh 0 -f 'feat\$cat_${lexTables["A"]}${dimRedModel}${msgk}_w\$${db["t"]}\$${db["c"]}\$bert' 'feat\$cat_${lexTables["At"]}${dimRedModel}${titlek}_w\$task_A_title\$user_id\$bert' 'feat\$cat_${lexTables["C"]}${dimRedModel}${msgk}_w\$task_Cfil\$user_id\$bert' 'feat\$cat_${lexTables["Ct"]}${dimRedModel}${titlek}_w\$task_Cfil_title\$user_id\$bert' --outcome_table task_labels_full --outcomes label --nfold_classifiers --model lr --folds 10 > ~/NLP/ContextualEmbeddingDR/results/clp${1}_adi/BERTb_${dimRedModel}/${resultFile}"
else
finalCommand="python3 ~/dlatk/dlatk/dlatkInterface.py -d ${db["d"]} -t ${db["t"]} -c ${db["c"]} --group_freq_thresh 0 -f 'feat\$cat_${lexTables["tr_fb"]}${dimRedModel}${msgk}_w\$${db["t"]}\$${db["c"]}\$bert' --outcome_table masterstats_friendratings --outcomes ope con ext agr neu --combo_test_reg --model ridgehighcv --folds 10 > ~/NLP/ContextualEmbeddingDR/results/${folderName}/BERTb_${dimRedModel}/${resultFile}"
fi
echo "${finalCommand}"
if [ "${noEval}" -eq "0" ];
then
eval "${finalCommand}"
fi
echo "----------------------------------------------------------------------"

if [ "${noEval}" -eq "0" ];
then
if [ "$1" -eq "18" ];
then
eval "cat ~/NLP/ContextualEmbeddingDR/results/${folderName}/BERTb_${dimRedModel}/${resultFile} | grep \'r\':"
elif [ "$1" -eq "19" ];
then
eval "cat ~/NLP/ContextualEmbeddingDR/results/${folderName}/BERTb_${dimRedModel}/${resultFile} | grep \'f1\':"
else
eval "cat ~/NLP/ContextualEmbeddingDR/results/${folderName}/BERTb_${dimRedModel}/${resultFile} | grep \'r\':"
fi
fi
99 changes: 99 additions & 0 deletions older_scripts/BERTBDimRedTest.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# author: @adithya8

if [ "$1" -eq "18" ];
then
declare -A db=(["d"]="clp18_adi" ["t"]="tr_a11essays" ["c"]="clp18_id" ["o"]="a11_bsag_total")
declare -A dbTables=(["te_a11"]="te_a11essays")
declare -A lexTables=(["te_a11"]="tr_a11_bertb_")
declare -A alpha=(["16"]="100" ["32"]="100" ["64"]="100" ["128"]="1000" ["256"]="1000" ["512"]="10000" ["1024"]="10000" ["2048"]="10000")
dimRedModel=$2
msgk=$3
noEval="0"
if [ "$4" -eq "1" ];
then
noEval="1"
fi

saveModelCommand="python3 ~/dlatk/dlatk/dlatkInterface.py -d ${db["d"]} -t ${db["t"]} -c ${db["c"]} --group_freq_thresh 0 -f 'feat\$cat_${lexTables["te_a11"]}${dimRedModel}${msgk}_w\$tr_a11essays\$clp18_id\$bert' --outcome_table tr_variables --outcomes a11_bsag_total --train_regression --model ridge${alpha[${msgk}]} --save_model --picklefile /data/avirinchipur/models/clp${1}_adi/BERTb_${dimRedModel}_${msgk}.pickle"
else
declare -A db=(["d"]="clp19_adi" ["t"]="task_A" ["c"]="user_id" ["o"]="label")
declare -A dbTables=(["Ate"]="task_A_test" ["Cte"]="task_Cfil_test" ["Atte"]="task_A_title_test" ["Ctte"]="task_Cfil_title_test")
declare -A lexTables=(["Ate"]="task_A_bert_" ["Cte"]="task_Cfil_bert_" ["Atte"]="taskAt_bert_" ["Ctte"]="taskCt_bert_")
dimRedModel=$2
msgk=$3
titlek=$4
noEval="0"

if [ "$5" -eq "1" ];
then
noEval="1"
fi
saveModelCommand="python3 ~/dlatk/dlatk/dlatkInterface.py -d ${db["d"]} -t ${db["t"]} -c ${db["c"]} --group_freq_thresh 0 -f 'feat\$cat_${lexTables["Ate"]}${dimRedModel}${msgk}_w\$task_A\$user_id\$bert' 'feat\$cat_${lexTables["Atte"]}${dimRedModel}${titlek}_w\$task_A_title\$user_id\$bert' 'feat\$cat_${lexTables["Cte"]}${dimRedModel}${msgk}_w\$task_Cfil\$user_id\$bert' 'feat\$cat_${lexTables["Ctte"]}${dimRedModel}${titlek}_w\$task_Cfil_title\$user_id\$bert' --outcome_table task_labels_full --outcomes label --train_classifiers --model lr --save_model --picklefile /data/avirinchipur/models/clp${1}_adi/BERTb_${dimRedModel}_${msgk}_${titlek}.pickle"
fi

echo "${saveModelCommand}"

if [ "$noEval" -eq "0" ];
then
eval "${saveModelCommand}"
fi


declare -A dimRedModels=(["fa"]="fa" ["pca"]="pca" ["nmf"]="nmf")

resultFile="BERTb_${dimRedModel}_${msgk}_${titlek}_test.txt"

echo "$resultFile"

echo "$1_$2_$3"

if [ "${noEval}" -eq "0" ];
then
if test -d "~/NLP/ContextualEmbeddingDR/results/clp${1}_adi/BERTb_${dimRedModel}/"; then
echo "Directory Exists"
else
eval "mkdir ~/NLP/ContextualEmbeddingDR/results/clp${1}_adi/BERTb_${dimRedModel}/"
if test -d "~/NLP/ContextualEmbeddingDR/results/clp${1}_adi/BERTb_${dimRedModel}/"; then
echo "Directory Created"
fi
fi
fi

for i in ${!dbTables[@]}
do
dbTable=${dbTables[$i]}
lexTable=${lexTables[$i]}

if [[ $dbTable =~ "title" ]]
then
k=${titlek}
else
k=${msgk}
fi

weightLex="python3 ~/dlatk/dlatk/dlatkInterface.py -d ${db["d"]} -t ${dbTable} -c ${db["c"]} --group_freq_thresh 0 --word_table 'feat\$bert_ba_un_memimaL10co\$${dbTable}\$${db["c"]}\$16to16' --add_lex -l ${lexTable}${dimRedModel}${k} --weighted_lex"
echo "${weightLex}"

if [ "${noEval}" -eq "0" ];
then
eval "${weightLex}"
fi

echo "----------------------------------------------------------------------"
done

if [ "$1" -eq "18" ];
then
finalCommand="python3 ~/dlatk/dlatk/dlatkInterface.py -d ${db["d"]} -t ${db["t"]} -c ${db["c"]} --group_freq_thresh 0 -f 'feat\$cat_${lexTables["te_a11"]}${dimRedModel}${msgk}_w\$te_a11essays\$${db["c"]}\$bert' --outcome_table te_a11essays_labels --outcomes a11_bsag_total --predict_regression --model ridge${alpha[${msgk}]} --load --picklefile /data/avirinchipur/models/clp${1}_adi/BERTb_${dimRedModel}_${msgk}.pickle > ~/NLP/ContextualEmbeddingDR/results/clp${1}_adi/BERTb_${dimRedModel}/${resultFile}"
else
finalCommand="python3 ~/dlatk/dlatk/dlatkInterface.py -d ${db["d"]} -t ${db["t"]} -c ${db["c"]} --group_freq_thresh 0 -f 'feat\$cat_task_A_bert_${dimRedModel}${msgk}_w\$task_A_test\$user_id\$bert' 'feat\$cat_taskAt_bert_${dimRedModel}${titlek}_w\$task_A_title_test\$user_id\$bert' 'feat\$cat_task_Cfil_bert_${dimRedModel}${msgk}_w\$task_Cfil_test\$user_id\$bert' 'feat\$cat_taskCt_bert_${dimRedModel}${titlek}_w\$task_Cfil_title_test\$user_id\$bert' --outcome_table crowd_test_A_label --outcomes label --predict_classifiers --model lr --load --picklefile /data/avirinchipur/models/clp${1}_adi/BERTb_${dimRedModel}_${msgk}_${titlek}.pickle > ~/NLP/ContextualEmbeddingDR/results/clp${1}_adi/BERTb_${dimRedModel}/${resultFile}"
fi
echo "${finalCommand}"

if [ "$noEval" -eq "0" ];
then
eval "${finalCommand}"
fi

echo "----------------------------------------------------------------------"

44 changes: 44 additions & 0 deletions older_scripts/ExpIter.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
#author: @adithya8

declare -a msg=(14 36 64 118 207 386)
declare -a title=(6 14 26 47 83 154)
declare -a msgK=(14 36 71 143 286 357 495)
declare -a titleK=(6 14 29 57 114 143 198)
declare -a totK=(16 32 64 128 256 512 1024 2048)
declare -a totK_=( )


# Last array index shouldn't apply for XLNet

if [ "$#" -eq "3" ];
then
experiment=$1
contextualEmbedding=$2
dimRedModel=$3
elif [ "$#" -eq "2" ];
then
experiment="19"
contextualEmbedding=$1
dimRedModel=$2
else
echo "Pass Contextual Embedding (BERTB/XLNet), DimRedModel name as arg (pca/fa/nmf/nmfrand) -- Exiting !!!!"
exit
fi

if [ "$1" -ne "19" ];
then
msgK=( "${totK[@]}" )
titleK=( "${totK_[@]}" )
fi

arraylength=${#msgK[@]}
for (( i=1; i<${arraylength}+1; i++ ));
do
echo "bash ~/NLP/ContextualEmbeddingDR/${contextualEmbedding}DimRedExp.sh ${experiment} ${dimRedModel} ${msgK[$i-1]} ${titleK[$i-1]}"
echo "----------------------------------------------------------------------"
eval "bash ~/NLP/ContextualEmbeddingDR/${contextualEmbedding}DimRedExp.sh ${experiment} ${dimRedModel} ${msgK[$i-1]} ${titleK[$i-1]}"
done

#echo "f1 scores for increasing k sizes"
#eval "cat ./results/XLNet_${dimRedModel}/* | grep \'f1\':"
#eval "python ~/NLP/ContextualEmbeddingDR/tableMaker.py ${experiment} ${contextualEmbedding} ${dimRedModel}"
75 changes: 75 additions & 0 deletions older_scripts/XLNetDimRedExp.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
#author: @adithya8

if [ "$1" -eq "18" ];
then
declare -A db=(["d"]="clp18_adi" ["t"]="tr_a11essays" ["c"]="clp18_id")
declare -A dbTables=(["tr_a11"]="tr_a11essays")
declare -A lexTables=(["tr_a11"]="tr_a11_xln_")
dimRedModel=$2
msgk=$3
else
declare -A db=(["d"]="clp19_adi" ["t"]="task_A" ["c"]="user_id")
declare -A dbTables=(["A"]="task_A" ["C"]="task_Cfil" ["At"]="task_A_title" ["Ct"]="task_Cfil_title")
declare -A lexTables=(["A"]="task_A_xln_" ["C"]="task_Cfil_xln_" ["At"]="taskAt_xln_" ["Ct"]="taskCt_xln_")
dimRedModel=$2
msgk=$3
titlek=$4
fi

declare -A dimRedModels=(["fa"]="fa" ["pca"]="pca" ["nmf"]="nmf")

resultFile="XLN_${dimRedModel}_${msgk}_${titlek}.txt"

echo "$resultFile"

echo "$1_$2_$3"

if test -d "~/NLP/ContextualEmbeddingDR/results/clp${1}_adi/XLNet_${dimRedModel}/"; then
echo "Directory Exists"
else
eval "mkdir ~/NLP/ContextualEmbeddingDR/results/clp${1}_adi/XLNet_${dimRedModel}/"
if test -d "~/NLP/ContextualEmbeddingDR/results/clp${1}_adi/XLNet_${dimRedModel}/"; then
echo "Directory Created"
fi
fi

for i in ${!dbTables[@]}
do
dbTable=${dbTables[$i]}
lexTable=${lexTables[$i]}

if [[ $dbTable =~ "title" ]]
then
k=${titlek}
else
k=${msgk}
fi

lexTableCr="~/dlatk/dlatk/dlatkInterface.py -d ${db["d"]} -t ${dbTable} -c ${db["c"]} --group_freq_thresh 0 -f 'feat\$xlnet_ba_ca_memamiL10co\$${dbTable}\$${db["c"]}\$16to16' --fit_reducer --model ${dimRedModel} --n_components ${k} --reducer_to_lexicon ${lexTable}${dimRedModel}${k}"
echo "${lexTableCr}"
eval "${lexTableCr}"
echo "----------------------------------------------------------------------"
weightLex="~/dlatk/dlatk/dlatkInterface.py -d ${db["d"]} -t ${dbTable} -c ${db["c"]} --group_freq_thresh 0 --word_table 'feat\$xlnet_ba_ca_memamiL10co\$${dbTable}\$${db["c"]}\$16to16' --add_lex -l ${lexTable}${dimRedModel}${k} --weighted_lex"
echo "${weightLex}"
eval "${weightLex}"
echo "----------------------------------------------------------------------"
done

if [ "$1" -eq "18" ];
then
finalCommand="~/dlatk/dlatk/dlatkInterface.py -d ${db["d"]} -t ${db["t"]} -c ${db["c"]} --group_freq_thresh 0 -f 'feat\$cat_${lexTables["tr_a11"]}${dimRedModel}${msgk}_w\$${db["t"]}\$${db["c"]}\$xlne' --outcome_table tr_variables --outcomes a11_bsag_total --combo_test_reg --model ridgehighcv --folds 10 > ~/NLP/ContextualEmbeddingDR/results/clp${1}_adi/XLNet_${dimRedModel}/${resultFile}"
#saveModelCommand="python3 ~/dlatk/dlatk/dlatkInterface.py -d ${db["d"]} -t ${db["t"]} -c ${db["c"]} --group_freq_thresh 0 -f 'feat\$cat_${lexTables["tr_a11"]}${dimRedModel}${msgk}_w\$${db["t"]}\$${db["c"]}\$xlne' --outcome_table task_labels_full --outcomes label --train_classifiers --model ridge --save_model --picklefile ~/NLP/ContextualEmbeddingDR/results/clp${1}_adi/XLNet_${dimRedModel}/xln_${dimRedModel}_${msgk}_${titlek}.pickle"
else
finalCommand="~/dlatk/dlatk/dlatkInterface.py -d ${db["d"]} -t ${db["t"]} -c ${db["c"]} --group_freq_thresh 0 -f 'feat\$cat_${lexTables["A"]}${dimRedModel}${msgk}_w\$${dbTables["A"]}\$${db["c"]}\$xlne' 'feat\$cat_${lexTables["At"]}${dimRedModel}${titlek}_w\$${dbTables["At"]}\$${db["c"]}\$xlne' 'feat\$cat_${lexTables["Ct"]}${dimRedModel}${titlek}_w\$${dbTables["Ct"]}\$${db["c"]}\$xlne' 'feat\$cat_${lexTables["C"]}${dimRedModel}${msgk}_w\$${dbTables["C"]}\$${db["c"]}\$xlne' --outcome_table task_labels_full --outcomes label --nfold_classifiers --model lr --folds 10 > ~/NLP/ContextualEmbeddingDR/results/clp${1}_adi/XLNet_${dimRedModel}/${resultFile}"
#saveModelCommand="python3 ~/dlatk/dlatk/dlatkInterface.py -d ${db["d"]} -t ${db["t"]} -c ${db["c"]} --group_freq_thresh 0 -f 'feat\$cat_task_A_xln_${dimRedModel}${msgk}_w\$task_A\$user_id\$xlne' 'feat\$cat_taskAt_xln_${dimRedModel}${titlek}_w\$task_A_title\$user_id\$xlne' 'feat\$cat_taskCt_xln_${dimRedModel}${titlek}_w\$task_Cfil_title\$user_id\$xlne' 'feat\$cat_task_Cfil_xln_${dimRedModel}${msgk}_w\$task_Cfil\$user_id\$xlne' --outcome_table task_labels_full --outcomes label --train_classifiers --model lr --save_model --picklefile ~/NLP/ContextualEmbeddingDR/results/clp${1}_adi/XLNet_${dimRedModel}/xln_${dimRedModel}_${msgk}_${titlek}.pickle"
fi
echo "${finalCommand}"
eval "${finalCommand}"
echo "----------------------------------------------------------------------"

if [ "$1" -eq "18" ];
then
eval "cat ~/NLP/ContextualEmbeddingDR/results/clp${1}_adi/XLNet_${dimRedModel}/${resultFile} | grep \'R\':"
else
eval "cat ~/NLP/ContextualEmbeddingDR/results/clp${1}_adi/XLNet_${dimRedModel}/${resultFile} | grep \'f1\':"
fi
Loading

0 comments on commit 4e9d262

Please sign in to comment.