-
Notifications
You must be signed in to change notification settings - Fork 696
Enabled multi-gpu training, buffers, grad clip in vocoder, saving optimizer state, and more fixes #109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Let me know if you have any questions. I am less familiar with DataParallel but it seems to be working on my multi-gpu setup. Note that I did not enable multi-gpus for the |
Also, let me know if the whole |
@fatchord let me know if there is anything you need to help get this merged! |
@TheButlah Apologies for the delay! I hope to get around to testing this at the weekend. And thanks btw - this looks good. |
Sorry one thing - would it possible to set the gpu ids in hparams.py or perhaps through argparse using that workaround? |
@fatchord I considered that, but I think the best way to do that is using the |
+ Added `voc_clip_grad_norm` hparam + Enabled gradient clipping in vocoder + Added warning when gradient is nan in vocoder and synthesizer + Added workaround for broken nn.DataParallel in `utils.__init__.py` * Refactored `tacotron.py` and `fatchord_version.py` to store step in a PyTorch buffer instead of a gradient-less parameter * Refactored training code for WaveRNN and tacotron to automagically use data parallelism when in the presence of multiple GPUS. ! Note that your batch size must be divisible by the number of GPUS
Update: I fixed a bug in my patch where I forgot to import one of the functions I made. More importantly though, I made the code save the optimizer state along with the latest weights in a separate file. This will ensure that when resuming training, the optimizer's momentum and other stateful properties are restored, ensuring that you can start and stop the model without affecting the properties of the training process. See #87 (comment) |
6422d25
to
45c6240
Compare
2f76935
to
bb4c66e
Compare
* Made model.load() always map devices to the device of the model's parameters instead of always to CPU.
* Fixed a bug where checking to see if a tensor was using cuda was done via comparison to the cuda device. This is not the correct way of doing things, not sure how it even worked before.
OK @fatchord This should be fixed and ready to go now. I identified a number of other bugs that I have now fixed. Sorry for all the force pushes! |
I tried training tacotron from your fork and got this: Traceback (most recent call last): |
Also I needed to add |
I'll take a look today/this weekend! Also I tested wavernn myself and didn't seem to have any issues with it, what was the cause and effect of doing flatten_parameters? |
c6deada
to
94302af
Compare
In other news, I cannot for the life of me figure out why the flatten parameters warning is still happening. We use |
+ Added type annotation for WaveRNN and Tacotron in train and generate files for each model * Fixed missing import for numpy in `fatchord_version.py` and `deepmind_version.py`
- Removed get_r and set_r * Moved ownership of self.r buffer from Tacotron to Decoder * Now using property decorator in Tacotron for self.r * Made all buffers reside on same device as parameters to fix issue where nn.parallel.gather expected tensor to be on a particular GPU but it was on a different device (CPU) instead. * Updated gen_tacotron.py and train_tacotron.py to use self.r property instead of getter and setter methods * Made all returned values from forward() tensors to ensure that nn.parallel.gather works properly
OK @fatchord I think I got the bugs sorted out - I've tested the code on both CPU and GPU and Multi-GPU. There is one issue however. What should the default value of self.r be? Is it 1? Basically this is an issue when training from scratch without a checkpoint file - when this happens, the model just uses a value of 0 for r and then it breaks. Depending on whether 1 can be used safely as a default or not may affect whether the pretrained models need to be updated. Let me know, for now I will make 1 the default. If the model was trained with r > 1, can r=1 be used for inference? I noticed that there is a I think figuring out what to do about r is the only thing remaining for getting this totally ready for a merge. |
Yeah, I think we're good to go here. Many thanks for your awesome work - I really appreciate it. |
@fatchord I think the branch you merged didn't contain my latest commit. Can you undo the merge I had a couple of minor fixes that weren't incorporated. |
OK I've identified some new bugs that break this patch, so unfortunately your master branch is broken right now. @fatchord I suggest reverting the patch till then. Bugs that need to be fixed (will keep this list updated): Breaking:
Non-breaking:
Potential for bugs (needs testing/assistance from @fatchord):
|
Status update: Just did a massive refactoring of the paths system, so that everything uses pathlib. Also began the process of refactoring the model checkpointing code. This is part of the effort to fix the "resuming from training" bug and the potential bugs dealing with model checkpoints, but is also to make the code cleaner, less brittle, and more likely to work on windows. Again as a reminder, @fatchord I suggest unmerging the patch until its ready, as both your repo and mine are in a broken state right now. |
@TheButlah I've reverted back - let me know what I can do to get this merged. |
@fatchord if you can add some code in a new branch on my fork to handle backwards compatability when loading the checkpoints (make sure r and step are loaded), that would be the most helpful :) |
@TheButlah No problem - I'll get around to it as soon as I get the time. |
@TheButlah Apologies, I've been super-busy the last week. I'll make time at the weekend to try and get your fork merged. |
No problem. As best as I can tell, Tacotron seems to be working properly. I
haven't really worked on wavernn, which is the primary reason I've not
submitted a new PR yet.
…On Fri, Aug 2, 2019, 03:08 fatchord ***@***.***> wrote:
@TheButlah <https://github.com/TheButlah> Apologies, I've been super-busy
the last week. I'll make time at the weekend to try and get your fork
merged.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#109?email_source=notifications&email_token=ABVFQR3FPQIGB2BKVMBQJDLQCQBSHA5CNFSM4ID3ZNGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3NKGNA#issuecomment-517645108>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABVFQR2FFX6ZWYSF2VNBQHTQCQBSHANCNFSM4ID3ZNGA>
.
|
To clarify, other than backwards compatability with previous saved models, I think I'm pretty happy with the state of the tacotron model on my fork. It properly does multi-gpu, guarantees that resuming checkpoints doesn't break training, and uses pathlib. I haven't tested the generation scripts yet though. I'm at the point where I'm in a bit of a crunch time at my job so I probably don't have time to work on WaveRNN right now, but it ought to be fairly easy to finish up the patch so that everything else is working, if either @fatchord or some other contributor can do it |
@TheButlah That's sounds great. I think the best thing is to train new models since I want to try predicting more mel channels (128 or 256). Once they are trained I will test the things you mention, add new pretrained weights and merge the whole thing. Should be done in a week. |
Quick update: My fork now appears to have both WaveRNN, Tacotron, and the respective inference scripts working. Training successfully works on both CPU, GPU, and Multi-GPU. I'll open a new PR. Please vet the code before merging though - its not thoroughly tested. |
Enabled multi-gpu training, buffers, grad clip in vocoder, saving optimizer state, and more fixes
Enabled multi-gpu training, buffers, grad clip in vocoder, saving optimizer state, and more fixes
voc_clip_grad_norm
hparamutils.__init__.py
tacotron.py
andfatchord_version.py
to store stepin a PyTorch buffer instead of a gradient-less parameter
use data parallelism when in the presence of multiple GPUS.
! Note that your batch size must be divisible by the number of GPUS