Skip to content

VisionTextDualEncoder: Distributed training is always enabled #24924

@phiyodr

Description

@phiyodr

System Info

  • transformers version: 4.32.0.dev0
  • Platform: Linux-5.15.0-76-generic-x86_64-with-glibc2.31
  • Python version: 3.10.10
  • Huggingface_hub version: 0.14.1
  • Safetensors version: 0.3.1
  • Accelerate version: 0.21.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.0.0+cu117 (True)
  • Tensorflow version (GPU?): 2.13.0 (False)
  • Flax version (CPU?/GPU?/TPU?): 0.7.0 (cpu)
  • Jax version: 0.4.13
  • JaxLib version: 0.4.13
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: It seems yes, but I don't want to ;)

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Hi,

I'm running the unchanged "VisionTextDualEncoder and CLIP model training example" on my local laptop (which has 1 GPU) and wonder why it claims to do distributed training: True (and not False). From the output:

07/19/2023 15:21:22 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False

The above output originates from run_clip.py

    logger.warning(
        f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
    )
  • The default should be training_args.local_rank=-1 according to TrainingArguments but is somehow set to 0 in this example and I don't know why.
  • Adding local_rank=-1 to the run_clip.py example script does not show any effect.

My questions:

  • Is it intended that local_rank is set to 0?
  • Does local_rank=0 really mean that distributed training in Trainer is enabled? (I'm new to Trainer and usually work with DistributedDataParallel)
  • How to switch off distributed training?

Bigger picture: Sometimes my training (on a cluster) hangs up in n-1 iteration and never finishes. I wonder if this has to do with distributed training. I don't know how to debug this.

100%|█████████▉| 2875/2876 [11:34<00:00,  4.10it/s]

Thanks in advance!

Expected behavior

I don't want to use distributed training, i.e. training_args.local_rank = -1

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions