Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
adithya8 authored May 25, 2021
1 parent 5341df8 commit 5f39b96
Showing 1 changed file with 35 additions and 24 deletions.
59 changes: 35 additions & 24 deletions models/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,10 @@ Although we trained models for a few domains of corpora, we recommend those usin
## Model formats
The models are available in two formats:

* `.csv` - A comma-seprated value format storing the linear transformation matrix of size `p X 768` where `p` is the low dimension.
* `.csv` - A comma-seprated value format storing the linear transformation matrix of size `K X 768` where `K` is the low dimension.
* `.pickle` - A pickle of the [Scikit-learn Decomposition](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition) modules.

## **Model Usage**
## Model Usage

The model files contain the matrix to reduce the dimensions of these user representations and this README explains how to use these models.

Expand All @@ -28,36 +28,50 @@ There are two steps to applying the models:

### Input format

All models assume you have an input matrix of `768 X N_observations`, where `N_observations` is the training set size. The goal is produce `p X N_obsevations` output where `p` is the lower dimensional represetnation from the model.
All models assume you have an input matrix of `N_observations X 768`, where `N_observations` is the training set size. The goal is produce `N_obsevations X K` output where `K` is the lower dimensional represetnation from the model.

*Aggregating to user-level.* In many situations, one has multiple documents/posts/messages per individual.

### **Using CSVs through python**
### Using CSVs through python

If you are using the CSVs, here is an example for how to use it:
Here is an example for how to use the CSV:

```py
def transform(user_emb):
#input:
# user_emb: numpy matrix of N_observations X 768 -- matrix of average RoBERTA layer 11 per user.
#output:
# transformed_user_emb: numpy matrix of N_observations X P -- low dimensional user representation.

import numpy as np
scalar = np.loadtxt("scalar.csv", delimiter=",")
#shape: (2, 768); 1st row -> mean; 2nd row -> std
user_emb = (user_emb - scalar[0]) / scalar[1]
model = np.loadtxt("model.csv", delimiter=",")
#shape of model: (x, 768)
#shape of model: (K, 768)
transformed_user_emb = np.dot(user_emb, model.T)

### **Using pickle files through python**
return transformed_user_emb
```
### Using pickle files through python

These pickle files are composed of a [Scikit-learn Decomposition](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition). To apply the learnt reduction, you can unpickle the model and run transform() method on the user embeddings.
Here is an example showing how:

def transform(user_emb):
#input:
# user_emb: numpy matrix of N_observations X 768 -- matrix of average RoBERTA layer 11 per user.
#output:
# transformed_user_emb: numpy matrix of N_observations X P -- low dimensional user representation.
import pickle
with open("model.pickle", "rb") as f:
model = pickle.load(f)["clusterModels"]["noOutcome"]
transformed_user_emb = model.transform(user_emb)
return transformed_user_emb

### **Using pickle files through DLATK**
```py
def transform(user_emb):
#input:
# user_emb: numpy matrix of N_observations X 768 -- matrix of average RoBERTA layer 11 per user.
#output:
# transformed_user_emb: numpy matrix of N_observations X K -- low dimensional user representation.
import pickle
with open("model.pickle", "rb") as f:
scalar = pickle.load(f)['scalers']['noOutcome']
model = pickle.load(f)['clusterModels']['noOutcome']
user_emb = scalar.transform(user_emb)
transformed_user_emb = model.transform(user_emb)
return transformed_user_emb
```
### Using pickle files through DLATK

The message data is composed of the user id (user_id), message id (message_id), the message field and the outcome field(s). The user embeddings are generated by averaging the transformer representation of all the messages from a user.

Expand All @@ -66,10 +80,7 @@ If the user embeddings have been generated using [DLATK](https://github.com/DLAT
python dlatkInterface.py -d {database-name} -t {table-name} -g {group-name} -f {user-embeddings-table-name} \
--transform_to_feats {dimred-table-name} --load --pickle {path-to-pickle-file}




## **Model Description**
## Model Description

The models made available are pre-trained to reduce 768 dimensions of roberta-base using 3 datasets from different domains: Facebook (D_20), CLPsych 2019 (D_19), and CLPsych 2018 (D_18).
D_20 dataset contains facebook posts of 55k users, while the D_19 has reddit posts from 496 users on r/SuicideWatch and D_18 contains essays written by approx 10k children. To know more about these datasets, refer to Section 3 in our paper.
Expand Down

0 comments on commit 5f39b96

Please sign in to comment.