Accelerate version not working properly #63

rimb05 · 2024-08-20T03:00:34Z

I tried training a model using:

accelerate launch train_accelerate.py

I get this output:
/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py:441: FutureWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
The following values were not passed to accelerate launch and had defaults used instead:
--num_processes was set to a value of 6
More than one GPU was found, enabling multi-GPU training.
If this was unintended please pass in --num_processes=1.
--num_machines was set to a value of 1
--dynamo_backend was set to a value of 'no'
To avoid this warning pass in values for each of the problematic parameters or run accelerate config.

It continues to train after the warning, but the loss value is always 'nan' and the validation results in 0.0dB for all stems.

The text was updated successfully, but these errors were encountered:

ZFTurbo · 2024-08-20T06:57:40Z

Did you run accelerate config?

rimb05 · 2024-08-20T09:18:40Z

Yes, I did and I chose all the default options.

ZFTurbo · 2024-08-20T09:25:07Z

I see you have some problem with nans. Try to choose float32 for training. Do you have the same problem on standard train.py script?

rimb05 · 2024-08-20T10:26:32Z

The same training works fine without accelerate (train.py). How would I enable float32, do you mean in accelerate config? I am training with mdx23c model.

rimb05 · 2024-08-21T17:58:16Z

I tried it again with fp32 ("use_amp" set to false), but I still get nans after a while. I also tried htdemucs instead of mdx23c. It made no difference. When I run the same training runs without accelerate, it runs fine for hourse. See my output below. You can see that after a couple of runs, I get nans for the loss, then the validation returns all zeros. I would really appreciate any insight.

/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py:441: FutureWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
The following values were not passed to accelerate launch and had defaults used instead:
--num_processes was set to a value of 6
More than one GPU was found, enabling multi-GPU training.
If this was unintended please pass in --num_processes=1.
--num_machines was set to a value of 1
--mixed_precision was set to a value of 'no'
--dynamo_backend was set to a value of 'no'
To avoid this warning pass in values for each of the problematic parameters or run accelerate config.
Instruments: ['vocal', 'drums', 'guitar', 'bass', 'piano', 'synth']
Old metadata was used for 24603 tracks.
Old metadata was used for 24603 tracks.
Old metadata was used for 24603 tracks.
Old metadata was used for 24603 tracks.
Old metadata was used for 24603 tracks.

0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
Use augmentation for training
Dataset type: 1 Processes to use: 64
Collecting metadata for ['training_output']
Found metadata cache file: results3/metadata_1.pkl
Old metadata was used for 24603 tracks.
0it [00:00, ?it/s]
Found tracks in dataset: 24603
Processes GPU: 6
Patience: 2 Reduce factor: 0.95 Batch size: 4 Grad accum steps: 1 Effective batch size: 4 Optimizer: adam
100%|█| 10/10 [00:24<00:00, 2.41s/it, sdr_vocal=-0.0673, sdr_drums=-0.0126, sdr_
Valid length: 59
Instr SDR vocal: -0.0731 Debug: 60
Instr SDR vocal: -0.0733 Debug: 60
Valid length: 59
Instr SDR drums: -0.0027 Debug: 60
Instr SDR drums: -0.0020 Debug: 60
Valid length: 59
Instr SDR guitar: -6.5638 Debug: 60
Instr SDR guitar: -6.6667 Debug: 60
Valid length: 59
Instr SDR bass: -3.7258 Debug: 60
Instr SDR bass: -3.7697 Debug: 60
Valid length: 59
Instr SDR piano: -8.7960 Debug: 60
Instr SDR piano: -8.7994 Debug: 60
Valid length: 59
Instr SDR synth: -2.4083 Debug: 60
Instr SDR synth: -2.4107 Debug: 60
SDR Avg: -3.6203
Train for: 1000
Train epoch: 0 Learning rate: 9e-05
100%|████████| 1000/1000 [10:16<00:00, 1.62it/s, loss=0.0779, avg_loss=6.87e+3]
Training loss: 68.670630
100%|█| 10/10 [00:23<00:00, 2.33s/it, sdr_vocal=2.84, sdr_drums=1.79, sdr_guitar=-1.
Instr SDR vocal: 2.5106 Debug: 60
Instr SDR drums: 1.3487 Debug: 60
Instr SDR guitar: -7.0081 Debug: 60
Instr SDR bass: -5.7739 Debug: 60
Instr SDR piano: -6.0657 Debug: 60
Instr SDR synth: -2.3428 Debug: 60
SDR Avg: -2.8885
Store weights: results3/model_htdemucs_ep_0_sdr_-2.8885.ckpt
Train epoch: 1 Learning rate: 9e-05
100%|███████████████| 1000/1000 [10:18<00:00, 1.62it/s, loss=nan, avg_loss=nan]
Training loss: nan
100%|█| 10/10 [00:24<00:00, 2.42s/it, sdr_vocal=0, sdr_drums=0, sdr_guitar=0, sdr_ba
Instr SDR vocal: 0.0000 Debug: 60
Instr SDR drums: 0.0000 Debug: 60
Instr SDR guitar: 0.0000 Debug: 60
Instr SDR bass: 0.0000 Debug: 60
Instr SDR piano: 0.0000 Debug: 60
Instr SDR synth: 0.0000 Debug: 60
SDR Avg: 0.0000
Store weights: results3/model_htdemucs_ep_1_sdr_0.0000.ckpt
Train epoch: 2 Learning rate: 8.55e-05

rimb05 · 2024-08-21T17:59:19Z

And here is my command line:

!accelerate launch Music-Source-Separation-Training/train_accelerate.py
--model_type htdemucs
--config_path config.yaml
--results_path results3
--data_path training_output
--valid_path training_output_eval
--dataset_type 1
--num_workers 4
--device_ids 0 1 2 3 4 5

rimb05 · 2024-08-23T15:16:23Z

Any ideas? When it runs, it is much faster, so it would be great if this could work properly.

ZFTurbo · 2024-08-24T07:07:28Z

Sorry. I myself have problems with this script. But I have problems with validation not with training. I had no time to fix it, yet. I will try on the next week.

rimb05 · 2024-08-27T19:21:37Z

Thanks. I can confirm the problem happens with many different models.

ZFTurbo · 2024-08-28T09:32:39Z

I did some fixes. The main issue probably was that I lost optimizer.zero_grad().
I has no machine now to test new code. Can you please check it if possible?

UPD: I tested a bit. Looks like it works fine now.

rimb05 · 2024-08-29T00:16:43Z

Thanks, looks good so far!

I had a general question: when training a model with lots of stems, I notice two things:

The GPU is only utilized in bursts. The utilization goes up and down from 20-30% to 100%
While it's training, it pauses for a few seconds then resumes. It does this throughout the training.

For the pausing, increasing the data workers helps, but doesn't completely solve the problem.

And for the GPU usage, It would be great if there was a way to use the GPU 100% all the time. So my question is: what causes this lack of GPU efficiency? Is it an SSD speed issue, or a processor issue? I'm training with 6 3090 GPUs with P2P enabled, so the GPU to GPU speed is 50GB/s bidirectional, and I am using a RAID 0 array that is 11GB/s. Would improving the CPU or SSD speed help with this?

Thanks!

rimb05 · 2024-08-31T23:04:22Z

I just upgraded my SSD but didn't see much improvement. I would love to get your insight on where the inefficiencies are occurring in these stem separation models.

jarredou · 2024-08-31T23:40:30Z

Some augmentations can also cause slowdowns during training when enabled (in particular pitch-shifting, time-stretching and mp3 encoding) and at least some of them, if not all, are done on CPU.

if you disable them all, is it significantly faster ?

ZFTurbo · 2024-09-01T06:33:39Z

During training check IO load (command iotop if you in Linux). If your data on SSD and you don't have very big batch size I'm sure it's not a problem.
Check that your batch size is not too big. Sometimes if memory is not enough you will observe big slow down. Reduce batch size a bit and try again.
As @jarredou said try to disable augmentations and check if it faster or not

rimb05 · 2024-09-01T18:47:35Z

Thanks for the help. I tried disabling all augmentations, and it didn't make much difference. My CPU must be fast enough to keep up.

However, I did notice an interesting thing - this issue only happens when I use more than one GPU. If I only train with a single GPU, the utilization is nearly 100% all the time. As soon as I add a second GPU, the utilization goes down. By the time I add 6 GPUs, it's about 50% utilization on average (swings from 0% to 100% periodically). What could be going on?

I checked the IO load with iotop and all the worker threads are using about 1-2% IO. I also upgraded my SSD raid0 array and now I have 23GB/sec, so I don't think that's the bottleneck.

The number of workers is currently at 24 (4 per GPU). I tried larger and it didn't make any difference.

I am testing with the MDX23c model.

ZFTurbo · 2024-09-01T18:52:06Z

Try to reduce workers. Make it less 2 or 4.

rimb05 · 2024-09-01T19:07:28Z

when I try this, it only processes 1 or 2 steps then it pauses for a few seconds, then does another 1 or 2, then it pauses.. In order to avoid the pauses I need to increase the workers to at least 16.

ZFTurbo · 2024-09-02T06:51:04Z

Did it happen on both version of train scripts (train.py and train_accelerate.py)?

rimb05 · 2024-09-02T10:49:02Z

Yes, it's the same thing on both, but the accelerate version is a little faster.

rimb05 · 2024-09-04T15:15:30Z

Looks like the problem was augmentations after all. Particularly the pitch and distortion. I didn't realize I had these turned on. Now I see ~20% improvement when running with accelerate and there are no more pauses. I'm also able to get my data workers down to 2 with no problem. Thanks for the help!

rimb05 · 2024-09-04T15:17:17Z

I do see one problem with the accelerate version though. For some reason, the learning rate decreases after every epoch. Patience is set to 3, but it still decreases the lr every time (even the first time). Might be the way the SDR is being averaged across all processes?

ZFTurbo · 2024-09-04T15:30:37Z

I do see one problem with the accelerate version though. For some reason, the learning rate decreases after every epoch. Patience is set to 3, but it still decreases the lr every time (even the first time). Might be the way the SDR is being averaged across all processes?

I couldn't fix it this issue. It's the problem with scheduler. I need to understand how to call it correctly.

rimb05 · 2024-09-04T15:48:07Z

It must be that the scheduler is being called multiple times per epoch (one time for every GPU?). It's the only way I can think of that the LR gets decreased even after one epoch...

ZFTurbo · 2024-09-04T15:56:32Z

It must be that the scheduler is being called multiple times per epoch (one time for every GPU?). It's the only way I can think of that the LR gets decreased even after one epoch...

Yes, but when I call it once on main thread LR became different for different GPUs... I need to understand the problem.

iver56 · 2024-09-05T06:43:11Z

In case you are using audiomentations 0.24.0 for data augmentation and you are observing bottleneck issues: I have improved the speed in audiomentations 0.27.0, 0.31.0, 0.34.1, 0.36.0, 0.36.1 and 0.37.0 (see changelog). Upgrading may help a little bit.

jarredou · 2024-09-05T22:21:05Z

If it's pedalboard's distortion that was slow, I would recommend to fully remove that augmentation as it's also creating huge gain changes while audiomentations has better alternative like tanh that is gain-balanced, sounding more musical and faster.
Most useful pedalboard augmentation is the reverb, for anything else I would go with audiomentations first.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate version not working properly #63

Accelerate version not working properly #63

rimb05 commented Aug 20, 2024

ZFTurbo commented Aug 20, 2024

rimb05 commented Aug 20, 2024

ZFTurbo commented Aug 20, 2024

rimb05 commented Aug 20, 2024 •

edited

Loading

rimb05 commented Aug 21, 2024

rimb05 commented Aug 21, 2024

rimb05 commented Aug 23, 2024

ZFTurbo commented Aug 24, 2024

rimb05 commented Aug 27, 2024

ZFTurbo commented Aug 28, 2024 •

edited

Loading

rimb05 commented Aug 29, 2024

rimb05 commented Aug 31, 2024

jarredou commented Aug 31, 2024 •

edited

Loading

ZFTurbo commented Sep 1, 2024

rimb05 commented Sep 1, 2024

ZFTurbo commented Sep 1, 2024

rimb05 commented Sep 1, 2024 •

edited

Loading

ZFTurbo commented Sep 2, 2024

rimb05 commented Sep 2, 2024

rimb05 commented Sep 4, 2024

rimb05 commented Sep 4, 2024

ZFTurbo commented Sep 4, 2024

rimb05 commented Sep 4, 2024

ZFTurbo commented Sep 4, 2024

iver56 commented Sep 5, 2024 •

edited

Loading

jarredou commented Sep 5, 2024 •

edited

Loading

Accelerate version not working properly #63

Accelerate version not working properly #63

Comments

rimb05 commented Aug 20, 2024

ZFTurbo commented Aug 20, 2024

rimb05 commented Aug 20, 2024

ZFTurbo commented Aug 20, 2024

rimb05 commented Aug 20, 2024 • edited Loading

rimb05 commented Aug 21, 2024

rimb05 commented Aug 21, 2024

rimb05 commented Aug 23, 2024

ZFTurbo commented Aug 24, 2024

rimb05 commented Aug 27, 2024

ZFTurbo commented Aug 28, 2024 • edited Loading

rimb05 commented Aug 29, 2024

rimb05 commented Aug 31, 2024

jarredou commented Aug 31, 2024 • edited Loading

ZFTurbo commented Sep 1, 2024

rimb05 commented Sep 1, 2024

ZFTurbo commented Sep 1, 2024

rimb05 commented Sep 1, 2024 • edited Loading

ZFTurbo commented Sep 2, 2024

rimb05 commented Sep 2, 2024

rimb05 commented Sep 4, 2024

rimb05 commented Sep 4, 2024

ZFTurbo commented Sep 4, 2024

rimb05 commented Sep 4, 2024

ZFTurbo commented Sep 4, 2024

iver56 commented Sep 5, 2024 • edited Loading

jarredou commented Sep 5, 2024 • edited Loading

rimb05 commented Aug 20, 2024 •

edited

Loading

ZFTurbo commented Aug 28, 2024 •

edited

Loading

jarredou commented Aug 31, 2024 •

edited

Loading

rimb05 commented Sep 1, 2024 •

edited

Loading

iver56 commented Sep 5, 2024 •

edited

Loading

jarredou commented Sep 5, 2024 •

edited

Loading