[wip/s2s/pl] attempt to sync metrics in DDP #8269

sshleifer · 2020-11-03T18:40:25Z

This is broken.
Attempted to add
AverageMetric where you just dump python floats and they get averaged and the end, but not working on DDP.

Failing command

(fails quickly at val sanity check)

cd examples/seq2seq
wget https://cdn-datasets.huggingface.co/translation/wmt_en_ro.tar.gz
tar -xzvf wmt_en_ro.tar.gz
export WANDB_PROJECT=dmar
export BS=64
export m=sshleifer/mar_enro_6_3_student
export MAX_LEN=128
python finetune.py \
  --learning_rate=3e-4 \
  --do_train \
  --do_predict \
  --fp16 \
  --val_check_interval 0.25 \
  --data_dir wmt_en_ro \
  --max_source_length $MAX_LEN --max_target_length $MAX_LEN --val_max_target_length $MAX_LEN --test_max_target_length $MAX_LEN \
  --freeze_encoder --freeze_embeds \
  --train_batch_size=$BS --eval_batch_size=$BS \
  --tokenizer_name Helsinki-NLP/opus-mt-en-ro --model_name_or_path $m \
  --warmup_steps 500 --sortish_sampler --logger_name wandb \
  --gpus 2 --fp16_opt_level=O1 --task translation --num_sanity_val_steps=1 --output_dir dmar_met_test_2gpu \
  --num_train_epochs=2 --overwrite_output_dir

Traceback

  File "/home/shleifer/transformers_fork/examples/seq2seq/finetune.py", line 206, in <dictcomp>
    pl_metrics = {f"pl_{prefix}_avg_{k}": v.compute().item() for k, v in self.metric_stores.items()}
  File "/home/shleifer/miniconda/lib/python3.8/site-packages/pytorch_lightning/metrics/metric.py", line 214, in wrapped_func
    self._sync_dist()
  File "/home/shleifer/miniconda/lib/python3.8/site-packages/pytorch_lightning/metrics/metric.py", line 177, in _sync_dist
    output_dict = apply_to_collection(
  File "/home/shleifer/miniconda/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 53, in apply_to_collection
    return elem_type({k: apply_to_collection(v, dtype, function, *args, **kwargs)
  File "/home/shleifer/miniconda/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 53, in <dictcomp>
    return elem_type({k: apply_to_collection(v, dtype, function, *args, **kwargs)
  File "/home/shleifer/miniconda/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 49, in apply_to_collection
    return function(data, *args, **kwargs)
  File "/home/shleifer/miniconda/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py", line 100, in gather_all_tensors_if_available
    torch.distributed.all_gather(gathered_result, result, group)
  File "/home/shleifer/miniconda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1185, in all_gather
    work = _default_pg.allgather([tensor_list], [tensor])
RuntimeError: Tensors must be CUDA and dense

github-actions · 2021-04-15T15:07:24Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

sshleifer added 5 commits November 3, 2020 09:44

start pl distributed metrics

a8195be

Merge branch 'master' into pl-metric-start

e6de23f

passing tests

51335dd

Merge branch 'master' into pl-metric-start

2729d84

working

b7c78af

sshleifer mentioned this pull request Nov 3, 2020

Upgrade PyTorch Lightning to 1.0.2 #7852

Merged

sshleifer changed the title ~~[wip/s2s] attempt to sync metrics in DDP~~ [wip/s2s/pl] attempt to sync metrics in DDP Nov 3, 2020

passing cpu

e36c14d

sshleifer linked an issue Nov 4, 2020 that may be closed by this pull request

seq2seq/finetune.py: remove useless check #8068

Closed

github-actions bot closed this Apr 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[wip/s2s/pl] attempt to sync metrics in DDP #8269

[wip/s2s/pl] attempt to sync metrics in DDP #8269

Uh oh!

sshleifer commented Nov 3, 2020 •

edited

Loading

Uh oh!

github-actions bot commented Apr 15, 2021

Uh oh!

Uh oh!

[wip/s2s/pl] attempt to sync metrics in DDP #8269

[wip/s2s/pl] attempt to sync metrics in DDP #8269

Uh oh!

Conversation

sshleifer commented Nov 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Failing command

Traceback

Uh oh!

github-actions bot commented Apr 15, 2021

Uh oh!

Uh oh!

sshleifer commented Nov 3, 2020 •

edited

Loading