-
Notifications
You must be signed in to change notification settings - Fork 204
Open
Description
Hi there, I hope to try discriminator-based PPLM with different sizes of GPT2. To do this, I believe we need to retrain the discriminator with a different embedding size using the paper_code/gpt2tunediscrim.py script. (Please correct me if I'm wrong here!) However, I am a little unclear on how the training text files should be formatted to be compatible with this code. It looks like each line in toxic_train.txt is processed with eval(d) to become a dictionary or json-like object with the keys 'text' and 'label'. Here is the excerpt of code I am looking at:
with open("datasets/toxic/toxic_train.txt") as f:
data = []
for d in f:
data.append(eval(d))
x = []
y = []
for d in data:
try:
# seq = tokenizer.encode("Apple's iOS 9 'App thinning' feature will give your phone's storage a boost")
seq = tokenizer.encode(d["text"])
device = 'cuda'
if(len(seq)<100):
seq = torch.tensor([50256] + seq, device=device, dtype=torch.long)
else:
continue
x.append(seq)
y.append(int(np.sum(d['label'])>0))
except:
pass
Is there any chance you can share your training text files (e.g. datasets/toxic/toxic_train.txt) or the script you used to create the text files from the original datasets? Thank you!
Metadata
Metadata
Assignees
Labels
No labels