diff --git a/models/README.md b/models/README.md index 1e48f7f..d83ff53 100644 --- a/models/README.md +++ b/models/README.md @@ -14,10 +14,10 @@ Although we trained models for a few domains of corpora, we recommend those usin ## Model formats The models are available in two formats: - * `.csv` - A comma-seprated value format storing the linear transformation matrix of size `p X 768` where `p` is the low dimension. + * `.csv` - A comma-seprated value format storing the linear transformation matrix of size `K X 768` where `K` is the low dimension. * `.pickle` - A pickle of the [Scikit-learn Decomposition](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition) modules. -## **Model Usage** +## Model Usage The model files contain the matrix to reduce the dimensions of these user representations and this README explains how to use these models. @@ -28,36 +28,50 @@ There are two steps to applying the models: ### Input format -All models assume you have an input matrix of `768 X N_observations`, where `N_observations` is the training set size. The goal is produce `p X N_obsevations` output where `p` is the lower dimensional represetnation from the model. +All models assume you have an input matrix of `N_observations X 768`, where `N_observations` is the training set size. The goal is produce `N_obsevations X K` output where `K` is the lower dimensional represetnation from the model. *Aggregating to user-level.* In many situations, one has multiple documents/posts/messages per individual. -### **Using CSVs through python** +### Using CSVs through python -If you are using the CSVs, here is an example for how to use it: +Here is an example for how to use the CSV: + +```py +def transform(user_emb): + #input: + # user_emb: numpy matrix of N_observations X 768 -- matrix of average RoBERTA layer 11 per user. + #output: + # transformed_user_emb: numpy matrix of N_observations X P -- low dimensional user representation. import numpy as np + scalar = np.loadtxt("scalar.csv", delimiter=",") + #shape: (2, 768); 1st row -> mean; 2nd row -> std + user_emb = (user_emb - scalar[0]) / scalar[1] model = np.loadtxt("model.csv", delimiter=",") - #shape of model: (x, 768) + #shape of model: (K, 768) transformed_user_emb = np.dot(user_emb, model.T) - -### **Using pickle files through python** + return transformed_user_emb +``` +### Using pickle files through python These pickle files are composed of a [Scikit-learn Decomposition](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition). To apply the learnt reduction, you can unpickle the model and run transform() method on the user embeddings. Here is an example showing how: - def transform(user_emb): - #input: - # user_emb: numpy matrix of N_observations X 768 -- matrix of average RoBERTA layer 11 per user. - #output: - # transformed_user_emb: numpy matrix of N_observations X P -- low dimensional user representation. - import pickle - with open("model.pickle", "rb") as f: - model = pickle.load(f)["clusterModels"]["noOutcome"] - transformed_user_emb = model.transform(user_emb) - return transformed_user_emb - -### **Using pickle files through DLATK** +```py +def transform(user_emb): + #input: + # user_emb: numpy matrix of N_observations X 768 -- matrix of average RoBERTA layer 11 per user. + #output: + # transformed_user_emb: numpy matrix of N_observations X K -- low dimensional user representation. + import pickle + with open("model.pickle", "rb") as f: + scalar = pickle.load(f)['scalers']['noOutcome'] + model = pickle.load(f)['clusterModels']['noOutcome'] + user_emb = scalar.transform(user_emb) + transformed_user_emb = model.transform(user_emb) + return transformed_user_emb +``` +### Using pickle files through DLATK The message data is composed of the user id (user_id), message id (message_id), the message field and the outcome field(s). The user embeddings are generated by averaging the transformer representation of all the messages from a user. @@ -66,10 +80,7 @@ If the user embeddings have been generated using [DLATK](https://github.com/DLAT python dlatkInterface.py -d {database-name} -t {table-name} -g {group-name} -f {user-embeddings-table-name} \ --transform_to_feats {dimred-table-name} --load --pickle {path-to-pickle-file} - - - -## **Model Description** +## Model Description The models made available are pre-trained to reduce 768 dimensions of roberta-base using 3 datasets from different domains: Facebook (D_20), CLPsych 2019 (D_19), and CLPsych 2018 (D_18). D_20 dataset contains facebook posts of 55k users, while the D_19 has reddit posts from 496 users on r/SuicideWatch and D_18 contains essays written by approx 10k children. To know more about these datasets, refer to Section 3 in our paper.