This repository is the official implementation of ICML2021: Selfish Sparse RNN Training
Our code is highly based on the awesome sparse training library of Sparse Momentum.
The library requires PyTorch v1.0.1 and CUDA v9.0.
You can download it via anaconda or pip, see PyTorch/get-started for further information.
We provide the training codes of Selfish stacked-LSTM and Selfish RHN.
To train Selfish stacked-LSTM on PTB dataset with GPU in the paper, run this command:
python main.py --sparse --sparse_init uniform --optimizer sgd --model LSTM --cuda --growth random --death magnitude --redistribution none --nonmono 5 --batch_size 20 --bptt 35 --lr 40 --clip 0.25 --seed 1111 --emsize 1500 --nhid 1500 --nlayers 2 --death-rate 0.8 --dropout 0.65 --density 0.33 --epochs 100
To train Selfish RHN on PTB dataset with GPU in the paper, run this command:
python main.py --sparse --sparse_init uniform --optimizer sgd --model RHN --cuda --tied --couple --seed 42 --nlayers 1 --growth random --death magnitude --redistribution none --density 0.472 --death-rate 0.5 --clip 0.25 --lr 15 --epochs 500 --dropout 0.65 --dropouth 0.25 --dropouti 0.65 --dropoute 0.2 --emsize 830 --nhid 830
To train Selfish ONLSTM on PTB dataset with GPU in the paper, run this two commands:
cd ONLSTM
python main_ONLSTM.py --sparse --sparse_init uniform --optimizer sgd --growth random --death magnitude --redistribution none --density 0.45 --death-rate 0.5 --batch_size 20 --dropout 0.45 --dropouth 0.3 --dropouti 0.5 --nonmono 5 --wdrop 0.45 --chunk_size 10 --seed 141 --epoch 1000
Options:
-
--sparse - Enable sparse mode (remove this if want to train dense model)
-
--sparse_init - type of sparse initialization. Choose from: uniform, ER (default uniform)
-
--evaluate (str) - pretrained model path (default none)
-
--model (str) - type of recurrent net, choose from RHN and LSTM (default LSTM)
-
--optimizer (str) - type of optimizers, choose from sgd (Sparse NT-ASGD) and adam (default sgd)
-
--growth (str) - growth mode. Choose from: random, random_rnn, gradient (default random)
-
--death (str) - removing mode. Choose from: magnitude, magnitude_rnn, SET, threshold (default magnitude)
-
--redistribution (str) - redistribution mode. Choose from: magnitude, nonzeros, or none. (default none)
-
--density (float) - density level (default 0.33)
-
--death-rate (float) - initial pruning rate (default 0.5)
To evaluate the pre-trained Selfish stacked-LSTM model on PTB, run:
python main.py --sparse --evaluate model_path --optimizer sgd --model LSTM --cuda --growth random --death magnitude --redistribution none --nonmono 5 --batch_size 20 --bptt 35 --lr 40 --clip 0.25 --seed 5 --emsize 1500 --nhid 1500 --nlayers 2 --death-rate 0.7 --dropout 0.65 --density 0.33 --epochs 100
To evaluate the pre-trained model, you need to replace the model_path with your model path and all the training hyper-parameters keep the same as the training command.
You can download the pretrained Selfish stacked-LSTM models here:
- Selfish stacked-LSTM trained on PTB.
This model gives 71.65 test perplexity on PTB dataset at sparsity of 0.67. To evaluate this pre-trained model, you need to run:
python main.py --sparse --evaluate model_path --optimizer sgd --model LSTM --cuda --growth random --death magnitude --redistribution none --nonmono 5 --batch_size 20 --bptt 35 --lr 40 --clip 0.25 --seed 5 --emsize 1500 --nhid 1500 --nlayers 2 --death-rate 0.7 --dropout 0.65 --density 0.33 --epochs 100
"model_path" is the path where you save this model.
Our model achieves the following performance on :
Model name | Sparsity | Validation perplexity | Test perplexity |
---|---|---|---|
Selfish stacked-LSTM | 0.67 | 73.79 | 71.65 |
Selfish RHN | 0.53 | 62.10 | 60.35 |
Selfish ONLSTM_1000 | 0.55 | 58.17+-0.06 | 56.31+-0.10 |
Selfish ONLSTM_1300 | 0.55 | 57.67+-0.03 | 55.82+-0.11 |
Model name | Sparsity | Validation perplexity | Test perplexity |
---|---|---|---|
Selfish AWD-LSTM-MoS without finetuning | 0.55 | 65.96 | 63.05 |
Apply Selfish-RNN to train other models is simple, you just need three steps:
(1) creating masks with
decay = CosineDecay(args.death_rate, args.epochs * len(train_data) // args.bptt) mask = Masking(optimizer, death_rate=args.death_rate, death_mode=args.death, death_rate_decay=decay, growth_mode=args.growth, redistribution_mode=args.redistribution, model=args.model) mask.add_module(model, sparse_init=args.sparse_init, density=args.density)
(2) change optimizer.step() to mask.step() in the training function.