Enabled multi-gpu training, buffers, grad clip in vocoder, saving optimizer state, and more fixes #109

TheButlah · 2019-07-16T00:11:04Z

Added voc_clip_grad_norm hparam
Enabled gradient clipping in vocoder
Added warning when gradient is nan in vocoder and synthesizer
Added workaround for broken nn.DataParallel in utils.__init__.py

Refactored tacotron.py and fatchord_version.py to store step
in a PyTorch buffer instead of a gradient-less parameter
Refactored training code for WaveRNN and tacotron to automagically
use data parallelism when in the presence of multiple GPUS.
! Note that your batch size must be divisible by the number of GPUS

TheButlah · 2019-07-16T00:51:12Z

Let me know if you have any questions. I am less familiar with DataParallel but it seems to be working on my multi-gpu setup. Note that I did not enable multi-gpus for the generate() methods as those methods seemed non-trivial to add the code to. Also, as before, I have not messed with the notebooks at all.

TheButlah · 2019-07-16T00:51:58Z

Also, let me know if the whole data_parallel_workaround() technique is no longer needed. It would be great if we could just use the normal nn.DataParallel instead.

TheButlah · 2019-07-17T16:55:22Z

@fatchord let me know if there is anything you need to help get this merged!

fatchord · 2019-07-18T08:27:28Z

@TheButlah Apologies for the delay! I hope to get around to testing this at the weekend.

And thanks btw - this looks good.

fatchord · 2019-07-18T08:29:37Z

Sorry one thing - would it possible to set the gpu ids in hparams.py or perhaps through argparse using that workaround?

TheButlah · 2019-07-18T17:44:44Z

@fatchord I considered that, but I think the best way to do that is using the CUDA_VISIBLE_DEVICES flag. In general most people want to use all gpus available to train on, and having to explicitly specify which GPUs are used as a list seems like uneccessary configuration for most situations. Someone needing that level of control should know about CUDA_VISIBLE_DEVICES anyway.

+ Added `voc_clip_grad_norm` hparam + Enabled gradient clipping in vocoder + Added warning when gradient is nan in vocoder and synthesizer + Added workaround for broken nn.DataParallel in `utils.__init__.py` * Refactored `tacotron.py` and `fatchord_version.py` to store step in a PyTorch buffer instead of a gradient-less parameter * Refactored training code for WaveRNN and tacotron to automagically use data parallelism when in the presence of multiple GPUS. ! Note that your batch size must be divisible by the number of GPUS

TheButlah · 2019-07-18T19:01:15Z

Update: I fixed a bug in my patch where I forgot to import one of the functions I made. More importantly though, I made the code save the optimizer state along with the latest weights in a separate file. This will ensure that when resuming training, the optimizer's momentum and other stateful properties are restored, ensuring that you can start and stop the model without affecting the properties of the training process. See #87 (comment)

* Made model.load() always map devices to the device of the model's parameters instead of always to CPU.

* Fixed a bug where checking to see if a tensor was using cuda was done via comparison to the cuda device. This is not the correct way of doing things, not sure how it even worked before.

TheButlah · 2019-07-19T00:56:13Z

OK @fatchord This should be fixed and ready to go now. I identified a number of other bugs that I have now fixed. Sorry for all the force pushes!

fatchord · 2019-07-19T13:41:59Z

I tried training tacotron from your fork and got this:

Traceback (most recent call last):
File "train_tacotron.py", line 180, in
tts_train_loop(model, optimizer, train_set, lr, training_steps, attn_example)
File "train_tacotron.py", line 39, in tts_train_loop
m1_hat, m2_hat, attention = data_parallel_workaround(model, x, m)
File "/code/ollie/test_multigpu/WaveRNN/utils/init.py", line 21, in data_parallel_workaround
replicas = torch.nn.parallel.replicate(model, device_ids)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/replicate.py", line 112, in replicate
buffer_copies_not_rg = _broadcast_coalesced_reshape(buffers_not_rg, devices, detach=True)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/replicate.py", line 76, in _broadcast_coalesced_reshape
return comm.broadcast_coalesced(tensors, devices)
File "/usr/local/lib/python3.6/dist-packages/torch/cuda/comm.py", line 39, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: all tensors must be on devices[0]

fatchord · 2019-07-19T13:43:27Z

Also I needed to add
self.rnn1.flatten_parameters()
self.rnn2.flatten_parameters()
to fatchord_version.py to get wavernn to work (worked perfectly after that)

TheButlah · 2019-07-19T16:51:39Z

I'll take a look today/this weekend! Also I tested wavernn myself and didn't seem to have any issues with it, what was the cause and effect of doing flatten_parameters?

preprocess.py

TheButlah · 2019-07-23T17:46:09Z

In other news, I cannot for the life of me figure out why the flatten parameters warning is still happening. We use GRUCell and LSTMCell, but those don't extend RNNBase and hence don't have the flatten_parameters() function. Any idea why PyTorch is complaining @fatchord ? I'm going to move on and fix the errors you mentioned in Tacotron and forget about the parameter flattening for now

+ Added type annotation for WaveRNN and Tacotron in train and generate files for each model * Fixed missing import for numpy in `fatchord_version.py` and `deepmind_version.py`

- Removed get_r and set_r * Moved ownership of self.r buffer from Tacotron to Decoder * Now using property decorator in Tacotron for self.r * Made all buffers reside on same device as parameters to fix issue where nn.parallel.gather expected tensor to be on a particular GPU but it was on a different device (CPU) instead. * Updated gen_tacotron.py and train_tacotron.py to use self.r property instead of getter and setter methods * Made all returned values from forward() tensors to ensure that nn.parallel.gather works properly

TheButlah · 2019-07-23T23:21:08Z

OK @fatchord I think I got the bugs sorted out - I've tested the code on both CPU and GPU and Multi-GPU.

There is one issue however. What should the default value of self.r be? Is it 1? Basically this is an issue when training from scratch without a checkpoint file - when this happens, the model just uses a value of 0 for r and then it breaks.

Depending on whether 1 can be used safely as a default or not may affect whether the pretrained models need to be updated. Let me know, for now I will make 1 the default.

If the model was trained with r > 1, can r=1 be used for inference? I noticed that there is a tts_r hyperparameter in hparams.py but it doesn't seem to be used, perhaps because its redundant now with the training schedule r.

I think figuring out what to do about r is the only thing remaining for getting this totally ready for a merge.

fatchord · 2019-07-24T18:14:35Z

Yeah, I think we're good to go here. Many thanks for your awesome work - I really appreciate it.

TheButlah · 2019-07-24T18:28:47Z

@fatchord I think the branch you merged didn't contain my latest commit. Can you undo the merge ~~and use the latest commits~~?

I had a couple of minor fixes that weren't incorporated. ~~The current state of my master branch is correct so that is safe to merge~~

TheButlah · 2019-07-24T20:18:36Z

OK I've identified some new bugs that break this patch, so unfortunately your master branch is broken right now. @fatchord I suggest reverting the patch till then.

Bugs that need to be fixed (will keep this list updated):

Breaking:

attention tensor is being accessed as a numpy array but its a Tensor. This is because I had to make all returns from forward() tensors, but the attention plot used to be just a numpy array. Must find all places where the attention plot is used and ensure that it is converted to a numpy array if necessary.
Issue with default r value being 0 instead of 1. Fixed on my master but still needs to be merged into fatchord's master.

Non-breaking:

Preprocess metavar for --num_workers named EXT instead of N
Resuming training from checkpoints doesn't guarantee reproducible results as the optimizer model state is not in all situations saved along with the model state. This means that saving and resuming from checkpoint files has some situations where it is not equivalent to just training until completion. Would be nice to guarantee consistency, will refactor model saving and loading to make this more explicit and provide a guarantee.
The saved model should be identical whether it was trained on GPU or CPU, but right now the model saves and keeps the devices used during training. This doesn't break the code as loading the model maps the devices to the correct ones selected for training, but it would be nice to have consistent behavior and identical saved models regardless of whether CUDA or CPU is used.

Potential for bugs (needs testing/assistance from @fatchord):

Did changing the model to use buffers instead of parameters for step and r, as well as making the decoder own the r buffer break the pretrained models? If so, those pretrained models need to be converted to be compatible with the new code. Additionally, making it possible to load those models in a backwards compatible way would be nice (for pretrained models made by others).
Does making the r value default to 1 for inference break the code? I think it probably does, this should be loaded from the model checkpoint instead to avoid this issue.

TheButlah · 2019-07-25T03:20:33Z

Status update: Just did a massive refactoring of the paths system, so that everything uses pathlib. Also began the process of refactoring the model checkpointing code. This is part of the effort to fix the "resuming from training" bug and the potential bugs dealing with model checkpoints, but is also to make the code cleaner, less brittle, and more likely to work on windows. Again as a reminder, @fatchord I suggest unmerging the patch until its ready, as both your repo and mine are in a broken state right now.

fatchord · 2019-07-25T08:38:52Z

@TheButlah I've reverted back - let me know what I can do to get this merged.

TheButlah · 2019-07-25T15:09:00Z

@fatchord if you can add some code in a new branch on my fork to handle backwards compatability when loading the checkpoints (make sure r and step are loaded), that would be the most helpful :)

fatchord · 2019-07-25T16:22:39Z

@TheButlah No problem - I'll get around to it as soon as I get the time.

fatchord · 2019-08-02T10:08:32Z

@TheButlah Apologies, I've been super-busy the last week. I'll make time at the weekend to try and get your fork merged.

TheButlah · 2019-08-02T10:15:14Z

No problem. As best as I can tell, Tacotron seems to be working properly. I haven't really worked on wavernn, which is the primary reason I've not submitted a new PR yet.

…

On Fri, Aug 2, 2019, 03:08 fatchord ***@***.***> wrote: @TheButlah <https://github.com/TheButlah> Apologies, I've been super-busy the last week. I'll make time at the weekend to try and get your fork merged. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#109?email_source=notifications&email_token=ABVFQR3FPQIGB2BKVMBQJDLQCQBSHA5CNFSM4ID3ZNGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3NKGNA#issuecomment-517645108>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABVFQR2FFX6ZWYSF2VNBQHTQCQBSHANCNFSM4ID3ZNGA> .

TheButlah · 2019-08-02T17:58:11Z

To clarify, other than backwards compatability with previous saved models, I think I'm pretty happy with the state of the tacotron model on my fork. It properly does multi-gpu, guarantees that resuming checkpoints doesn't break training, and uses pathlib. I haven't tested the generation scripts yet though.

I'm at the point where I'm in a bit of a crunch time at my job so I probably don't have time to work on WaveRNN right now, but it ought to be fairly easy to finish up the patch so that everything else is working, if either @fatchord or some other contributor can do it

fatchord · 2019-08-05T09:06:32Z

@TheButlah That's sounds great. I think the best thing is to train new models since I want to try predicting more mel channels (128 or 256). Once they are trained I will test the things you mention, add new pretrained weights and merge the whole thing. Should be done in a week.

TheButlah · 2019-08-07T17:44:29Z

Quick update: My fork now appears to have both WaveRNN, Tacotron, and the respective inference scripts working. Training successfully works on both CPU, GPU, and Multi-GPU. I'll open a new PR. Please vet the code before merging though - its not thoroughly tested.

Enabled multi-gpu training, buffers, grad clip in vocoder, saving optimizer state, and more fixes

TheButlah force-pushed the master branch from 9bcc89d to 748136d Compare July 16, 2019 00:48

TheButlah force-pushed the master branch from 748136d to 315627c Compare July 18, 2019 18:07

TheButlah force-pushed the master branch from 315627c to debf77c Compare July 18, 2019 18:30

Made models num_params() actually useful if print_out==False

bcb67c2

TheButlah changed the title ~~Enabled multi-gpu training, buffers, grad clip in vocoder~~ Enabled multi-gpu training, buffers, grad clip in vocoder, and more fixes Jul 18, 2019

TheButlah changed the title ~~Enabled multi-gpu training, buffers, grad clip in vocoder, and more fixes~~ Enabled multi-gpu training, buffers, grad clip in vocoder, saving optimizer state, and more fixes Jul 18, 2019

TheButlah force-pushed the master branch 5 times, most recently from 6422d25 to 45c6240 Compare July 18, 2019 23:31

Fixed a bunch of my own bugs, now saving optimizer state

7c628ff

TheButlah force-pushed the master branch 3 times, most recently from 2f76935 to bb4c66e Compare July 19, 2019 00:15

TheButlah added 2 commits July 18, 2019 17:45

Fixed bug due to model.load() placing tensors on CPU.

521e9fc

* Made model.load() always map devices to the device of the model's parameters instead of always to CPU.

Fixed bug with improper comparison device == torch.device('cuda')

38d2e55

* Fixed a bug where checking to see if a tensor was using cuda was done via comparison to the cuda device. This is not the correct way of doing things, not sure how it even worked before.

TheButlah force-pushed the master branch from bb4c66e to 38d2e55 Compare July 19, 2019 00:45

TheButlah commented Jul 22, 2019

View reviewed changes

preprocess.py Outdated Show resolved Hide resolved

TheButlah force-pushed the master branch 2 times, most recently from c6deada to 94302af Compare July 22, 2019 23:28

TheButlah added 2 commits July 23, 2019 15:43

Fixed missing import, added type annotation for IDE suggestions

2d8acf7

+ Added type annotation for WaveRNN and Tacotron in train and generate files for each model * Fixed missing import for numpy in `fatchord_version.py` and `deepmind_version.py`

TheButlah force-pushed the master branch from 94302af to 675e2d0 Compare July 23, 2019 22:48

TheButlah mentioned this pull request Jul 23, 2019

Multi GPU support #61

Closed

TheButlah force-pushed the master branch from 8b90f27 to 1484fbb Compare July 24, 2019 17:39

TheButlah added 2 commits July 24, 2019 10:44

Made preprocess.py use a configurable number of workers

9cdf6b8

Made self.r in Tacotron default to 1

4b49726

TheButlah force-pushed the master branch from 1484fbb to 4b49726 Compare July 24, 2019 17:46

fatchord merged commit 98c553c into fatchord:master Jul 24, 2019

fatchord mentioned this pull request Jul 25, 2019

Revert "Enabled multi-gpu training, buffers, grad clip in vocoder, saving optimizer state, and more fixes" #118

Merged

TheButlah mentioned this pull request Aug 7, 2019

Add Multi-GPU training, Griffin-Lim vocoder, safely restore checkpoints, pathlib, multiple hparams files #126

Merged

datitran referenced this pull request in spring-media/ForwardTacotron Mar 9, 2020

Merge pull request #109 from TheButlah/master

86a1e5c

Enabled multi-gpu training, buffers, grad clip in vocoder, saving optimizer state, and more fixes

geneing pushed a commit to geneing/WaveRNN that referenced this pull request Mar 14, 2020

Merge pull request fatchord#109 from TheButlah/master

0cc7de5

Enabled multi-gpu training, buffers, grad clip in vocoder, saving optimizer state, and more fixes

Enabled multi-gpu training, buffers, grad clip in vocoder, saving optimizer state, and more fixes #109

Enabled multi-gpu training, buffers, grad clip in vocoder, saving optimizer state, and more fixes #109

Uh oh!

Conversation

TheButlah commented Jul 16, 2019

Uh oh!

TheButlah commented Jul 16, 2019

Uh oh!

TheButlah commented Jul 16, 2019

Uh oh!

TheButlah commented Jul 17, 2019

Uh oh!

fatchord commented Jul 18, 2019

Uh oh!

fatchord commented Jul 18, 2019

Uh oh!

TheButlah commented Jul 18, 2019

Uh oh!

TheButlah commented Jul 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheButlah commented Jul 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fatchord commented Jul 19, 2019

Uh oh!

fatchord commented Jul 19, 2019

Uh oh!

TheButlah commented Jul 19, 2019

Uh oh!

Uh oh!

TheButlah commented Jul 23, 2019

Uh oh!

TheButlah commented Jul 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fatchord commented Jul 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheButlah commented Jul 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheButlah commented Jul 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheButlah commented Jul 25, 2019

Uh oh!

fatchord commented Jul 25, 2019

Uh oh!

TheButlah commented Jul 25, 2019

Uh oh!

fatchord commented Jul 25, 2019

Uh oh!

fatchord commented Aug 2, 2019

Uh oh!

TheButlah commented Aug 2, 2019 via email

Uh oh!

TheButlah commented Aug 2, 2019

Uh oh!

fatchord commented Aug 5, 2019

Uh oh!

TheButlah commented Aug 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

TheButlah commented Jul 18, 2019 •

edited

Loading

TheButlah commented Jul 19, 2019 •

edited

Loading

TheButlah commented Jul 23, 2019 •

edited

Loading

fatchord commented Jul 24, 2019 •

edited

Loading

TheButlah commented Jul 24, 2019 •

edited

Loading

TheButlah commented Jul 24, 2019 •

edited

Loading

TheButlah commented Aug 7, 2019 •

edited

Loading