Skip to content

Conversation

vfdev-5
Copy link
Collaborator

@vfdev-5 vfdev-5 commented Oct 22, 2021

Description:

cc @sdesrozis

Description:
- Now uses new gloo group to compute nproc per node
  - Context: using NCCL and if user badly setups cuda per proc, idist will hang on _compute_nproc_per_node
  - Here is an example: https://app.circleci.com/pipelines/github/pytorch/ignite/2264/workflows/2e3073fd-0859-41c7-91e8-eef0f8eabee2/jobs/7060?invite=true#step-107-872
  - However, I couldn't repro the issue on my setup

cc @sdesrozier
@github-actions github-actions bot added ci CI module: distributed Distributed module labels Oct 22, 2021
@vfdev-5 vfdev-5 requested a review from sdesrozis October 23, 2021 11:35
Copy link
Contributor

@sdesrozis sdesrozis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vfdev-5 That looks great !

@sdesrozis sdesrozis merged commit 4a37e35 into master Oct 23, 2021
@vfdev-5 vfdev-5 deleted the fix-compute_nproc_per_node branch January 17, 2022 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci CI module: distributed Distributed module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants