-
Notifications
You must be signed in to change notification settings - Fork 31.3k
Closed
Labels
Description
System Info
transformersversion: 4.32.0.dev0- Platform: Linux-5.15.0-76-generic-x86_64-with-glibc2.31
- Python version: 3.10.10
- Huggingface_hub version: 0.14.1
- Safetensors version: 0.3.1
- Accelerate version: 0.21.0
- Accelerate config: not found
- PyTorch version (GPU?): 2.0.0+cu117 (True)
- Tensorflow version (GPU?): 2.13.0 (False)
- Flax version (CPU?/GPU?/TPU?): 0.7.0 (cpu)
- Jax version: 0.4.13
- JaxLib version: 0.4.13
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: It seems yes, but I don't want to ;)
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Hi,
I'm running the unchanged "VisionTextDualEncoder and CLIP model training example" on my local laptop (which has 1 GPU) and wonder why it claims to do distributed training: True (and not False). From the output:
07/19/2023 15:21:22 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
The above output originates from run_clip.py
logger.warning(
f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
+ f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
)
- The default should be
training_args.local_rank=-1according toTrainingArgumentsbut is somehow set to0in this example and I don't know why. - Adding
local_rank=-1to the run_clip.py example script does not show any effect.
My questions:
- Is it intended that
local_rankis set to0? - Does
local_rank=0really mean that distributed training inTraineris enabled? (I'm new toTrainerand usually work withDistributedDataParallel) - How to switch off distributed training?
Bigger picture: Sometimes my training (on a cluster) hangs up in n-1 iteration and never finishes. I wonder if this has to do with distributed training. I don't know how to debug this.
100%|█████████▉| 2875/2876 [11:34<00:00, 4.10it/s]
Thanks in advance!
Expected behavior
I don't want to use distributed training, i.e. training_args.local_rank = -1