-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accelerate version not working properly #63
Comments
Did you run |
Yes, I did and I chose all the default options. |
I see you have some problem with nans. Try to choose float32 for training. Do you have the same problem on standard train.py script? |
The same training works fine without accelerate (train.py). How would I enable float32, do you mean in accelerate config? I am training with mdx23c model. |
I tried it again with fp32 ("use_amp" set to false), but I still get nans after a while. I also tried htdemucs instead of mdx23c. It made no difference. When I run the same training runs without accelerate, it runs fine for hourse. See my output below. You can see that after a couple of runs, I get nans for the loss, then the validation returns all zeros. I would really appreciate any insight. /usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py:441: FutureWarning: 0it [00:00, ?it/s] |
And here is my command line: !accelerate launch Music-Source-Separation-Training/train_accelerate.py |
Any ideas? When it runs, it is much faster, so it would be great if this could work properly. |
Sorry. I myself have problems with this script. But I have problems with validation not with training. I had no time to fix it, yet. I will try on the next week. |
Thanks. I can confirm the problem happens with many different models. |
I did some fixes. The main issue probably was that I lost UPD: I tested a bit. Looks like it works fine now. |
Thanks, looks good so far! I had a general question: when training a model with lots of stems, I notice two things:
For the pausing, increasing the data workers helps, but doesn't completely solve the problem. And for the GPU usage, It would be great if there was a way to use the GPU 100% all the time. So my question is: what causes this lack of GPU efficiency? Is it an SSD speed issue, or a processor issue? I'm training with 6 3090 GPUs with P2P enabled, so the GPU to GPU speed is 50GB/s bidirectional, and I am using a RAID 0 array that is 11GB/s. Would improving the CPU or SSD speed help with this? Thanks! |
I just upgraded my SSD but didn't see much improvement. I would love to get your insight on where the inefficiencies are occurring in these stem separation models. |
Some augmentations can also cause slowdowns during training when enabled (in particular pitch-shifting, time-stretching and mp3 encoding) and at least some of them, if not all, are done on CPU. if you disable them all, is it significantly faster ? |
|
Thanks for the help. I tried disabling all augmentations, and it didn't make much difference. My CPU must be fast enough to keep up. However, I did notice an interesting thing - this issue only happens when I use more than one GPU. If I only train with a single GPU, the utilization is nearly 100% all the time. As soon as I add a second GPU, the utilization goes down. By the time I add 6 GPUs, it's about 50% utilization on average (swings from 0% to 100% periodically). What could be going on? I checked the IO load with iotop and all the worker threads are using about 1-2% IO. I also upgraded my SSD raid0 array and now I have 23GB/sec, so I don't think that's the bottleneck. The number of workers is currently at 24 (4 per GPU). I tried larger and it didn't make any difference. I am testing with the MDX23c model. |
Try to reduce workers. Make it less 2 or 4. |
when I try this, it only processes 1 or 2 steps then it pauses for a few seconds, then does another 1 or 2, then it pauses.. In order to avoid the pauses I need to increase the workers to at least 16. |
Did it happen on both version of train scripts ( |
Yes, it's the same thing on both, but the accelerate version is a little faster. |
Looks like the problem was augmentations after all. Particularly the pitch and distortion. I didn't realize I had these turned on. Now I see ~20% improvement when running with accelerate and there are no more pauses. I'm also able to get my data workers down to 2 with no problem. Thanks for the help! |
I do see one problem with the accelerate version though. For some reason, the learning rate decreases after every epoch. Patience is set to 3, but it still decreases the lr every time (even the first time). Might be the way the SDR is being averaged across all processes? |
I couldn't fix it this issue. It's the problem with |
It must be that the scheduler is being called multiple times per epoch (one time for every GPU?). It's the only way I can think of that the LR gets decreased even after one epoch... |
Yes, but when I call it once on main thread LR became different for different GPUs... I need to understand the problem. |
In case you are using audiomentations 0.24.0 for data augmentation and you are observing bottleneck issues: I have improved the speed in audiomentations 0.27.0, 0.31.0, 0.34.1, 0.36.0, 0.36.1 and 0.37.0 (see changelog). Upgrading may help a little bit. |
If it's pedalboard's distortion that was slow, I would recommend to fully remove that augmentation as it's also creating huge gain changes while audiomentations has better alternative like tanh that is gain-balanced, sounding more musical and faster. |
I tried training a model using:
accelerate launch train_accelerate.py
I get this output:
/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py:441: FutureWarning:
torch.utils._pytree._register_pytree_node
is deprecated. Please usetorch.utils._pytree.register_pytree_node
instead._torch_pytree._register_pytree_node(
The following values were not passed to
accelerate launch
and had defaults used instead:--num_processes
was set to a value of6
More than one GPU was found, enabling multi-GPU training.
If this was unintended please pass in
--num_processes=1
.--num_machines
was set to a value of1
--dynamo_backend
was set to a value of'no'
To avoid this warning pass in values for each of the problematic parameters or run
accelerate config
.It continues to train after the warning, but the loss value is always 'nan' and the validation results in 0.0dB for all stems.
The text was updated successfully, but these errors were encountered: