Skip to content

Conversation

TheButlah
Copy link
Contributor

  • Added voc_clip_grad_norm hparam
  • Enabled gradient clipping in vocoder
  • Added warning when gradient is nan in vocoder and synthesizer
  • Added workaround for broken nn.DataParallel in utils.__init__.py
  • Refactored tacotron.py and fatchord_version.py to store step
    in a PyTorch buffer instead of a gradient-less parameter
  • Refactored training code for WaveRNN and tacotron to automagically
    use data parallelism when in the presence of multiple GPUS.
    ! Note that your batch size must be divisible by the number of GPUS

@TheButlah
Copy link
Contributor Author

Let me know if you have any questions. I am less familiar with DataParallel but it seems to be working on my multi-gpu setup. Note that I did not enable multi-gpus for the generate() methods as those methods seemed non-trivial to add the code to. Also, as before, I have not messed with the notebooks at all.

@TheButlah
Copy link
Contributor Author

Also, let me know if the whole data_parallel_workaround() technique is no longer needed. It would be great if we could just use the normal nn.DataParallel instead.

@TheButlah
Copy link
Contributor Author

@fatchord let me know if there is anything you need to help get this merged!

@fatchord
Copy link
Owner

@TheButlah Apologies for the delay! I hope to get around to testing this at the weekend.

And thanks btw - this looks good.

@fatchord
Copy link
Owner

Sorry one thing - would it possible to set the gpu ids in hparams.py or perhaps through argparse using that workaround?

@TheButlah
Copy link
Contributor Author

@fatchord I considered that, but I think the best way to do that is using the CUDA_VISIBLE_DEVICES flag. In general most people want to use all gpus available to train on, and having to explicitly specify which GPUs are used as a list seems like uneccessary configuration for most situations. Someone needing that level of control should know about CUDA_VISIBLE_DEVICES anyway.

+ Added `voc_clip_grad_norm` hparam
+ Enabled gradient clipping in vocoder
+ Added warning when gradient is nan in vocoder and synthesizer
+ Added workaround for broken nn.DataParallel in `utils.__init__.py`
* Refactored `tacotron.py` and `fatchord_version.py` to store step
  in a PyTorch buffer instead of a gradient-less parameter
* Refactored training code for WaveRNN and tacotron to automagically
  use data parallelism when in the presence of multiple GPUS.
! Note that your batch size must be divisible by the number of GPUS
@TheButlah TheButlah changed the title Enabled multi-gpu training, buffers, grad clip in vocoder Enabled multi-gpu training, buffers, grad clip in vocoder, and more fixes Jul 18, 2019
@TheButlah
Copy link
Contributor Author

TheButlah commented Jul 18, 2019

Update: I fixed a bug in my patch where I forgot to import one of the functions I made. More importantly though, I made the code save the optimizer state along with the latest weights in a separate file. This will ensure that when resuming training, the optimizer's momentum and other stateful properties are restored, ensuring that you can start and stop the model without affecting the properties of the training process. See #87 (comment)

@TheButlah TheButlah changed the title Enabled multi-gpu training, buffers, grad clip in vocoder, and more fixes Enabled multi-gpu training, buffers, grad clip in vocoder, saving optimizer state, and more fixes Jul 18, 2019
@TheButlah TheButlah force-pushed the master branch 5 times, most recently from 6422d25 to 45c6240 Compare July 18, 2019 23:31
@TheButlah TheButlah force-pushed the master branch 3 times, most recently from 2f76935 to bb4c66e Compare July 19, 2019 00:15
* Made model.load() always map devices to the device of the model's
  parameters instead of always to CPU.
* Fixed a bug where checking to see if a tensor was using cuda was
  done via comparison to the cuda device. This is not the correct
  way of doing things, not sure how it even worked before.
@TheButlah
Copy link
Contributor Author

TheButlah commented Jul 19, 2019

OK @fatchord This should be fixed and ready to go now. I identified a number of other bugs that I have now fixed. Sorry for all the force pushes!

@fatchord
Copy link
Owner

I tried training tacotron from your fork and got this:

Traceback (most recent call last):
File "train_tacotron.py", line 180, in
tts_train_loop(model, optimizer, train_set, lr, training_steps, attn_example)
File "train_tacotron.py", line 39, in tts_train_loop
m1_hat, m2_hat, attention = data_parallel_workaround(model, x, m)
File "/code/ollie/test_multigpu/WaveRNN/utils/init.py", line 21, in data_parallel_workaround
replicas = torch.nn.parallel.replicate(model, device_ids)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/replicate.py", line 112, in replicate
buffer_copies_not_rg = _broadcast_coalesced_reshape(buffers_not_rg, devices, detach=True)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/replicate.py", line 76, in _broadcast_coalesced_reshape
return comm.broadcast_coalesced(tensors, devices)
File "/usr/local/lib/python3.6/dist-packages/torch/cuda/comm.py", line 39, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: all tensors must be on devices[0]

@fatchord
Copy link
Owner

Also I needed to add
self.rnn1.flatten_parameters()
self.rnn2.flatten_parameters()
to fatchord_version.py to get wavernn to work (worked perfectly after that)

@TheButlah
Copy link
Contributor Author

I'll take a look today/this weekend! Also I tested wavernn myself and didn't seem to have any issues with it, what was the cause and effect of doing flatten_parameters?

@TheButlah TheButlah force-pushed the master branch 2 times, most recently from c6deada to 94302af Compare July 22, 2019 23:28
@TheButlah
Copy link
Contributor Author

In other news, I cannot for the life of me figure out why the flatten parameters warning is still happening. We use GRUCell and LSTMCell, but those don't extend RNNBase and hence don't have the flatten_parameters() function. Any idea why PyTorch is complaining @fatchord ? I'm going to move on and fix the errors you mentioned in Tacotron and forget about the parameter flattening for now

+ Added type annotation for WaveRNN and Tacotron in train and
  generate files for each model
* Fixed missing import for numpy in `fatchord_version.py` and
  `deepmind_version.py`
- Removed get_r and set_r
* Moved ownership of self.r buffer from Tacotron to Decoder
* Now using property decorator in Tacotron for self.r
* Made all buffers reside on same device as parameters to fix issue
  where nn.parallel.gather expected tensor to be on a particular
  GPU but it was on a different device (CPU) instead.
* Updated gen_tacotron.py and train_tacotron.py to use self.r
  property instead of getter and setter methods
* Made all returned values from forward() tensors to ensure that
  nn.parallel.gather works properly
@TheButlah
Copy link
Contributor Author

TheButlah commented Jul 23, 2019

OK @fatchord I think I got the bugs sorted out - I've tested the code on both CPU and GPU and Multi-GPU.

There is one issue however. What should the default value of self.r be? Is it 1? Basically this is an issue when training from scratch without a checkpoint file - when this happens, the model just uses a value of 0 for r and then it breaks.

Depending on whether 1 can be used safely as a default or not may affect whether the pretrained models need to be updated. Let me know, for now I will make 1 the default.

If the model was trained with r > 1, can r=1 be used for inference? I noticed that there is a tts_r hyperparameter in hparams.py but it doesn't seem to be used, perhaps because its redundant now with the training schedule r.

I think figuring out what to do about r is the only thing remaining for getting this totally ready for a merge.

@fatchord
Copy link
Owner

fatchord commented Jul 24, 2019

Yeah, I think we're good to go here. Many thanks for your awesome work - I really appreciate it.

@fatchord fatchord merged commit 98c553c into fatchord:master Jul 24, 2019
@TheButlah
Copy link
Contributor Author

TheButlah commented Jul 24, 2019

@fatchord I think the branch you merged didn't contain my latest commit. Can you undo the merge and use the latest commits?

I had a couple of minor fixes that weren't incorporated. The current state of my master branch is correct so that is safe to merge

@TheButlah
Copy link
Contributor Author

TheButlah commented Jul 24, 2019

OK I've identified some new bugs that break this patch, so unfortunately your master branch is broken right now. @fatchord I suggest reverting the patch till then.

Bugs that need to be fixed (will keep this list updated):

Breaking:

  • attention tensor is being accessed as a numpy array but its a Tensor. This is because I had to make all returns from forward() tensors, but the attention plot used to be just a numpy array. Must find all places where the attention plot is used and ensure that it is converted to a numpy array if necessary.
  • Issue with default r value being 0 instead of 1. Fixed on my master but still needs to be merged into fatchord's master.

Non-breaking:

  • Preprocess metavar for --num_workers named EXT instead of N
  • Resuming training from checkpoints doesn't guarantee reproducible results as the optimizer model state is not in all situations saved along with the model state. This means that saving and resuming from checkpoint files has some situations where it is not equivalent to just training until completion. Would be nice to guarantee consistency, will refactor model saving and loading to make this more explicit and provide a guarantee.
  • The saved model should be identical whether it was trained on GPU or CPU, but right now the model saves and keeps the devices used during training. This doesn't break the code as loading the model maps the devices to the correct ones selected for training, but it would be nice to have consistent behavior and identical saved models regardless of whether CUDA or CPU is used.

Potential for bugs (needs testing/assistance from @fatchord):

  • Did changing the model to use buffers instead of parameters for step and r, as well as making the decoder own the r buffer break the pretrained models? If so, those pretrained models need to be converted to be compatible with the new code. Additionally, making it possible to load those models in a backwards compatible way would be nice (for pretrained models made by others).
  • Does making the r value default to 1 for inference break the code? I think it probably does, this should be loaded from the model checkpoint instead to avoid this issue.

@TheButlah
Copy link
Contributor Author

Status update: Just did a massive refactoring of the paths system, so that everything uses pathlib. Also began the process of refactoring the model checkpointing code. This is part of the effort to fix the "resuming from training" bug and the potential bugs dealing with model checkpoints, but is also to make the code cleaner, less brittle, and more likely to work on windows. Again as a reminder, @fatchord I suggest unmerging the patch until its ready, as both your repo and mine are in a broken state right now.

@fatchord
Copy link
Owner

@TheButlah I've reverted back - let me know what I can do to get this merged.

@TheButlah
Copy link
Contributor Author

@fatchord if you can add some code in a new branch on my fork to handle backwards compatability when loading the checkpoints (make sure r and step are loaded), that would be the most helpful :)

@fatchord
Copy link
Owner

@TheButlah No problem - I'll get around to it as soon as I get the time.

@fatchord
Copy link
Owner

fatchord commented Aug 2, 2019

@TheButlah Apologies, I've been super-busy the last week. I'll make time at the weekend to try and get your fork merged.

@TheButlah
Copy link
Contributor Author

TheButlah commented Aug 2, 2019 via email

@TheButlah
Copy link
Contributor Author

To clarify, other than backwards compatability with previous saved models, I think I'm pretty happy with the state of the tacotron model on my fork. It properly does multi-gpu, guarantees that resuming checkpoints doesn't break training, and uses pathlib. I haven't tested the generation scripts yet though.

I'm at the point where I'm in a bit of a crunch time at my job so I probably don't have time to work on WaveRNN right now, but it ought to be fairly easy to finish up the patch so that everything else is working, if either @fatchord or some other contributor can do it

@fatchord
Copy link
Owner

fatchord commented Aug 5, 2019

@TheButlah That's sounds great. I think the best thing is to train new models since I want to try predicting more mel channels (128 or 256). Once they are trained I will test the things you mention, add new pretrained weights and merge the whole thing. Should be done in a week.

@TheButlah
Copy link
Contributor Author

TheButlah commented Aug 7, 2019

Quick update: My fork now appears to have both WaveRNN, Tacotron, and the respective inference scripts working. Training successfully works on both CPU, GPU, and Multi-GPU. I'll open a new PR. Please vet the code before merging though - its not thoroughly tested.

datitran referenced this pull request in spring-media/ForwardTacotron Mar 9, 2020
Enabled multi-gpu training, buffers, grad clip in vocoder, saving optimizer state, and more fixes
geneing pushed a commit to geneing/WaveRNN that referenced this pull request Mar 14, 2020
Enabled multi-gpu training, buffers, grad clip in vocoder, saving optimizer state, and more fixes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants