Skip to content

[wip/s2s/pl] attempt to sync metrics in DDP #8269

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

sshleifer
Copy link
Contributor

@sshleifer sshleifer commented Nov 3, 2020

This is broken.
Attempted to add
AverageMetric where you just dump python floats and they get averaged and the end, but not working on DDP.

Failing command

(fails quickly at val sanity check)

cd examples/seq2seq
wget https://cdn-datasets.huggingface.co/translation/wmt_en_ro.tar.gz
tar -xzvf wmt_en_ro.tar.gz
export WANDB_PROJECT=dmar
export BS=64
export m=sshleifer/mar_enro_6_3_student
export MAX_LEN=128
python finetune.py \
  --learning_rate=3e-4 \
  --do_train \
  --do_predict \
  --fp16 \
  --val_check_interval 0.25 \
  --data_dir wmt_en_ro \
  --max_source_length $MAX_LEN --max_target_length $MAX_LEN --val_max_target_length $MAX_LEN --test_max_target_length $MAX_LEN \
  --freeze_encoder --freeze_embeds \
  --train_batch_size=$BS --eval_batch_size=$BS \
  --tokenizer_name Helsinki-NLP/opus-mt-en-ro --model_name_or_path $m \
  --warmup_steps 500 --sortish_sampler --logger_name wandb \
  --gpus 2 --fp16_opt_level=O1 --task translation --num_sanity_val_steps=1 --output_dir dmar_met_test_2gpu \
  --num_train_epochs=2 --overwrite_output_dir

Traceback

  File "/home/shleifer/transformers_fork/examples/seq2seq/finetune.py", line 206, in <dictcomp>
    pl_metrics = {f"pl_{prefix}_avg_{k}": v.compute().item() for k, v in self.metric_stores.items()}
  File "/home/shleifer/miniconda/lib/python3.8/site-packages/pytorch_lightning/metrics/metric.py", line 214, in wrapped_func
    self._sync_dist()
  File "/home/shleifer/miniconda/lib/python3.8/site-packages/pytorch_lightning/metrics/metric.py", line 177, in _sync_dist
    output_dict = apply_to_collection(
  File "/home/shleifer/miniconda/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 53, in apply_to_collection
    return elem_type({k: apply_to_collection(v, dtype, function, *args, **kwargs)
  File "/home/shleifer/miniconda/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 53, in <dictcomp>
    return elem_type({k: apply_to_collection(v, dtype, function, *args, **kwargs)
  File "/home/shleifer/miniconda/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 49, in apply_to_collection
    return function(data, *args, **kwargs)
  File "/home/shleifer/miniconda/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py", line 100, in gather_all_tensors_if_available
    torch.distributed.all_gather(gathered_result, result, group)
  File "/home/shleifer/miniconda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1185, in all_gather
    work = _default_pg.allgather([tensor_list], [tensor])
RuntimeError: Tensors must be CUDA and dense

@sshleifer sshleifer changed the title [wip/s2s] attempt to sync metrics in DDP [wip/s2s/pl] attempt to sync metrics in DDP Nov 3, 2020
@sshleifer sshleifer linked an issue Nov 4, 2020 that may be closed by this pull request
@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this Apr 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

seq2seq/finetune.py: remove useless check
1 participant