Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerate version not working properly #63

Open
rimb05 opened this issue Aug 20, 2024 · 26 comments
Open

Accelerate version not working properly #63

rimb05 opened this issue Aug 20, 2024 · 26 comments

Comments

@rimb05
Copy link

rimb05 commented Aug 20, 2024

I tried training a model using:

accelerate launch train_accelerate.py

I get this output:
/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py:441: FutureWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
The following values were not passed to accelerate launch and had defaults used instead:
--num_processes was set to a value of 6
More than one GPU was found, enabling multi-GPU training.
If this was unintended please pass in --num_processes=1.
--num_machines was set to a value of 1
--dynamo_backend was set to a value of 'no'
To avoid this warning pass in values for each of the problematic parameters or run accelerate config.

It continues to train after the warning, but the loss value is always 'nan' and the validation results in 0.0dB for all stems.

@ZFTurbo
Copy link
Owner

ZFTurbo commented Aug 20, 2024

Did you run accelerate config?

@rimb05
Copy link
Author

rimb05 commented Aug 20, 2024

Yes, I did and I chose all the default options.

@ZFTurbo
Copy link
Owner

ZFTurbo commented Aug 20, 2024

I see you have some problem with nans. Try to choose float32 for training. Do you have the same problem on standard train.py script?

@rimb05
Copy link
Author

rimb05 commented Aug 20, 2024

The same training works fine without accelerate (train.py). How would I enable float32, do you mean in accelerate config? I am training with mdx23c model.

@rimb05
Copy link
Author

rimb05 commented Aug 21, 2024

I tried it again with fp32 ("use_amp" set to false), but I still get nans after a while. I also tried htdemucs instead of mdx23c. It made no difference. When I run the same training runs without accelerate, it runs fine for hourse. See my output below. You can see that after a couple of runs, I get nans for the loss, then the validation returns all zeros. I would really appreciate any insight.

/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py:441: FutureWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
The following values were not passed to accelerate launch and had defaults used instead:
--num_processes was set to a value of 6
More than one GPU was found, enabling multi-GPU training.
If this was unintended please pass in --num_processes=1.
--num_machines was set to a value of 1
--mixed_precision was set to a value of 'no'
--dynamo_backend was set to a value of 'no'
To avoid this warning pass in values for each of the problematic parameters or run accelerate config.
Instruments: ['vocal', 'drums', 'guitar', 'bass', 'piano', 'synth']
Old metadata was used for 24603 tracks.
Old metadata was used for 24603 tracks.
Old metadata was used for 24603 tracks.
Old metadata was used for 24603 tracks.
Old metadata was used for 24603 tracks.

0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
Use augmentation for training
Dataset type: 1 Processes to use: 64
Collecting metadata for ['training_output']
Found metadata cache file: results3/metadata_1.pkl
Old metadata was used for 24603 tracks.
0it [00:00, ?it/s]
Found tracks in dataset: 24603
Processes GPU: 6
Patience: 2 Reduce factor: 0.95 Batch size: 4 Grad accum steps: 1 Effective batch size: 4 Optimizer: adam
100%|█| 10/10 [00:24<00:00, 2.41s/it, sdr_vocal=-0.0673, sdr_drums=-0.0126, sdr_
Valid length: 59
Instr SDR vocal: -0.0731 Debug: 60
Instr SDR vocal: -0.0733 Debug: 60
Valid length: 59
Instr SDR drums: -0.0027 Debug: 60
Instr SDR drums: -0.0020 Debug: 60
Valid length: 59
Instr SDR guitar: -6.5638 Debug: 60
Instr SDR guitar: -6.6667 Debug: 60
Valid length: 59
Instr SDR bass: -3.7258 Debug: 60
Instr SDR bass: -3.7697 Debug: 60
Valid length: 59
Instr SDR piano: -8.7960 Debug: 60
Instr SDR piano: -8.7994 Debug: 60
Valid length: 59
Instr SDR synth: -2.4083 Debug: 60
Instr SDR synth: -2.4107 Debug: 60
SDR Avg: -3.6203
Train for: 1000
Train epoch: 0 Learning rate: 9e-05
100%|████████| 1000/1000 [10:16<00:00, 1.62it/s, loss=0.0779, avg_loss=6.87e+3]
Training loss: 68.670630
100%|█| 10/10 [00:23<00:00, 2.33s/it, sdr_vocal=2.84, sdr_drums=1.79, sdr_guitar=-1.
Instr SDR vocal: 2.5106 Debug: 60
Instr SDR drums: 1.3487 Debug: 60
Instr SDR guitar: -7.0081 Debug: 60
Instr SDR bass: -5.7739 Debug: 60
Instr SDR piano: -6.0657 Debug: 60
Instr SDR synth: -2.3428 Debug: 60
SDR Avg: -2.8885
Store weights: results3/model_htdemucs_ep_0_sdr_-2.8885.ckpt
Train epoch: 1 Learning rate: 9e-05
100%|███████████████| 1000/1000 [10:18<00:00, 1.62it/s, loss=nan, avg_loss=nan]
Training loss: nan
100%|█| 10/10 [00:24<00:00, 2.42s/it, sdr_vocal=0, sdr_drums=0, sdr_guitar=0, sdr_ba
Instr SDR vocal: 0.0000 Debug: 60
Instr SDR drums: 0.0000 Debug: 60
Instr SDR guitar: 0.0000 Debug: 60
Instr SDR bass: 0.0000 Debug: 60
Instr SDR piano: 0.0000 Debug: 60
Instr SDR synth: 0.0000 Debug: 60
SDR Avg: 0.0000
Store weights: results3/model_htdemucs_ep_1_sdr_0.0000.ckpt
Train epoch: 2 Learning rate: 8.55e-05

@rimb05
Copy link
Author

rimb05 commented Aug 21, 2024

And here is my command line:

!accelerate launch Music-Source-Separation-Training/train_accelerate.py
--model_type htdemucs
--config_path config.yaml
--results_path results3
--data_path training_output
--valid_path training_output_eval
--dataset_type 1
--num_workers 4
--device_ids 0 1 2 3 4 5

@rimb05
Copy link
Author

rimb05 commented Aug 23, 2024

Any ideas? When it runs, it is much faster, so it would be great if this could work properly.

@ZFTurbo
Copy link
Owner

ZFTurbo commented Aug 24, 2024

Sorry. I myself have problems with this script. But I have problems with validation not with training. I had no time to fix it, yet. I will try on the next week.

@rimb05
Copy link
Author

rimb05 commented Aug 27, 2024

Thanks. I can confirm the problem happens with many different models.

@ZFTurbo
Copy link
Owner

ZFTurbo commented Aug 28, 2024

I did some fixes. The main issue probably was that I lost optimizer.zero_grad().
I has no machine now to test new code. Can you please check it if possible?

UPD: I tested a bit. Looks like it works fine now.

@rimb05
Copy link
Author

rimb05 commented Aug 29, 2024

Thanks, looks good so far!

I had a general question: when training a model with lots of stems, I notice two things:

  • The GPU is only utilized in bursts. The utilization goes up and down from 20-30% to 100%
  • While it's training, it pauses for a few seconds then resumes. It does this throughout the training.

For the pausing, increasing the data workers helps, but doesn't completely solve the problem.

And for the GPU usage, It would be great if there was a way to use the GPU 100% all the time. So my question is: what causes this lack of GPU efficiency? Is it an SSD speed issue, or a processor issue? I'm training with 6 3090 GPUs with P2P enabled, so the GPU to GPU speed is 50GB/s bidirectional, and I am using a RAID 0 array that is 11GB/s. Would improving the CPU or SSD speed help with this?

Thanks!

@rimb05
Copy link
Author

rimb05 commented Aug 31, 2024

I just upgraded my SSD but didn't see much improvement. I would love to get your insight on where the inefficiencies are occurring in these stem separation models.

@jarredou
Copy link

jarredou commented Aug 31, 2024

Some augmentations can also cause slowdowns during training when enabled (in particular pitch-shifting, time-stretching and mp3 encoding) and at least some of them, if not all, are done on CPU.

if you disable them all, is it significantly faster ?

@ZFTurbo
Copy link
Owner

ZFTurbo commented Sep 1, 2024

  1. During training check IO load (command iotop if you in Linux). If your data on SSD and you don't have very big batch size I'm sure it's not a problem.
  2. Check that your batch size is not too big. Sometimes if memory is not enough you will observe big slow down. Reduce batch size a bit and try again.
  3. As @jarredou said try to disable augmentations and check if it faster or not

@rimb05
Copy link
Author

rimb05 commented Sep 1, 2024

Thanks for the help. I tried disabling all augmentations, and it didn't make much difference. My CPU must be fast enough to keep up.

However, I did notice an interesting thing - this issue only happens when I use more than one GPU. If I only train with a single GPU, the utilization is nearly 100% all the time. As soon as I add a second GPU, the utilization goes down. By the time I add 6 GPUs, it's about 50% utilization on average (swings from 0% to 100% periodically). What could be going on?

I checked the IO load with iotop and all the worker threads are using about 1-2% IO. I also upgraded my SSD raid0 array and now I have 23GB/sec, so I don't think that's the bottleneck.

The number of workers is currently at 24 (4 per GPU). I tried larger and it didn't make any difference.

I am testing with the MDX23c model.

@ZFTurbo
Copy link
Owner

ZFTurbo commented Sep 1, 2024

Try to reduce workers. Make it less 2 or 4.

@rimb05
Copy link
Author

rimb05 commented Sep 1, 2024

when I try this, it only processes 1 or 2 steps then it pauses for a few seconds, then does another 1 or 2, then it pauses.. In order to avoid the pauses I need to increase the workers to at least 16.

@ZFTurbo
Copy link
Owner

ZFTurbo commented Sep 2, 2024

Did it happen on both version of train scripts (train.py and train_accelerate.py)?

@rimb05
Copy link
Author

rimb05 commented Sep 2, 2024

Yes, it's the same thing on both, but the accelerate version is a little faster.

@rimb05
Copy link
Author

rimb05 commented Sep 4, 2024

Looks like the problem was augmentations after all. Particularly the pitch and distortion. I didn't realize I had these turned on. Now I see ~20% improvement when running with accelerate and there are no more pauses. I'm also able to get my data workers down to 2 with no problem. Thanks for the help!

@rimb05
Copy link
Author

rimb05 commented Sep 4, 2024

I do see one problem with the accelerate version though. For some reason, the learning rate decreases after every epoch. Patience is set to 3, but it still decreases the lr every time (even the first time). Might be the way the SDR is being averaged across all processes?

@ZFTurbo
Copy link
Owner

ZFTurbo commented Sep 4, 2024

I do see one problem with the accelerate version though. For some reason, the learning rate decreases after every epoch. Patience is set to 3, but it still decreases the lr every time (even the first time). Might be the way the SDR is being averaged across all processes?

I couldn't fix it this issue. It's the problem with scheduler. I need to understand how to call it correctly.

@rimb05
Copy link
Author

rimb05 commented Sep 4, 2024

It must be that the scheduler is being called multiple times per epoch (one time for every GPU?). It's the only way I can think of that the LR gets decreased even after one epoch...

@ZFTurbo
Copy link
Owner

ZFTurbo commented Sep 4, 2024

It must be that the scheduler is being called multiple times per epoch (one time for every GPU?). It's the only way I can think of that the LR gets decreased even after one epoch...

Yes, but when I call it once on main thread LR became different for different GPUs... I need to understand the problem.

@iver56
Copy link

iver56 commented Sep 5, 2024

In case you are using audiomentations 0.24.0 for data augmentation and you are observing bottleneck issues: I have improved the speed in audiomentations 0.27.0, 0.31.0, 0.34.1, 0.36.0, 0.36.1 and 0.37.0 (see changelog). Upgrading may help a little bit.

@jarredou
Copy link

jarredou commented Sep 5, 2024

If it's pedalboard's distortion that was slow, I would recommend to fully remove that augmentation as it's also creating huge gain changes while audiomentations has better alternative like tanh that is gain-balanced, sounding more musical and faster.
Most useful pedalboard augmentation is the reverb, for anything else I would go with audiomentations first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants