fix parameter placement #3

YazhiGao · 2021-09-21T01:39:25Z

Summary: fix tensor placement where the remote device should receive {rank, local_rank}

Differential Revision: D31072120

Summary: fix tensor placement where the remote device should receive {rank, local_rank} Differential Revision: D31072120 fbshipit-source-id: 13de19a31a5cafeef280ed7b38a8372a4038fe89

facebook-github-bot · 2021-09-21T01:39:41Z

This pull request was exported from Phabricator. Differential Revision: D31072120

Summary: Some bug fixes during the integration test in PyPER O3: ### fix pytorch#1 `_embedding_bag_collection` (`ShardedEmbeddingBagCollection`) is not really called by input_dist (because the same thing is already distributed by ShardedManagedCollisionCollection) . So it never get a chance to initiate `_input_dist`. As a result, TREC pipelining thinks it's not ready for input distribution. This is not expected, since the module is not used in the stage anyway, nor should it be put in fused a2a communication. With this change, https://fburl.com/code/ud8lnixv it'll satisfy the assertion, meanwhile doesn't carry _input_dists so won't be put into fused a2a. ### fix pytorch#2 ManagedCollisionCollection.forward is not traceable because it uses unwarpped `KeyedJaggedTensor.from_jt_dict`. We don't care about its internal detail so just keep it atomic. ### fix pytorch#3 Due to how remap table is set, `MCHManagedCollisionModule` doesn't support i32 id list for now. An easy fix is to convert to i64 regardless. A more memory efficient fix is probably change the remapper to i32 if necessary Differential Revision: D48804332

Summary: Some bug fixes during the integration test in PyPER O3: fix pytorch#1 _embedding_bag_collection (ShardedEmbeddingBagCollection) is not really called by input_dist (because the same thing is already distributed by ShardedManagedCollisionCollection) . So it never get a chance to initiate _input_dist. As a result, TREC pipelining thinks it's not ready for input distribution. This is not expected, since the module is not used in the stage anyway, nor should it be put in fused a2a communication. With this change, https://fburl.com/code/ud8lnixv it'll satisfy the assertion, meanwhile doesn't carry _input_dists so won't be put into fused a2a. fix pytorch#2 ManagedCollisionCollection.forward is not traceable because it uses unwarpped KeyedJaggedTensor.from_jt_dict. We don't care about its internal detail so just keep it atomic. fix pytorch#3 Due to how remap table is set, MCHManagedCollisionModule doesn't support i32 id list for now. An easy fix is to convert to i64 regardless. A more memory efficient fix is probably change the remapper to i32 if necessary Differential Revision: D51601041

Summary: Pull Request resolved: #1541 Some bug fixes during the integration test in PyPER O3: # fix #1 _embedding_bag_collection (ShardedEmbeddingBagCollection) is not really called by input_dist (because the same thing is already distributed by ShardedManagedCollisionCollection) . So it never get a chance to initiate _input_dist. As a result, TREC pipelining thinks it's not ready for input distribution. This is not expected, since the module is not used in the stage anyway, nor should it be put in fused a2a communication. With this change, https://fburl.com/code/ud8lnixv it'll satisfy the assertion, meanwhile doesn't carry _input_dists so won't be put into fused a2a. # fix #2 ManagedCollisionCollection.forward is not traceable because it uses unwarpped KeyedJaggedTensor.from_jt_dict. We don't care about its internal detail so just keep it atomic. # fix #3 Due to how remap table is set, MCHManagedCollisionModule doesn't support i32 id list for now. An easy fix is to convert to i64 regardless. A more memory efficient fix is probably change the remapper to i32 if necessary Reviewed By: dstaay-fb Differential Revision: D51601041 fbshipit-source-id: 95cf346b5247f1d5afb6643ecfd7dca4b3c4d575

fix parameter placement

1f04b46

Summary: fix tensor placement where the remote device should receive {rank, local_rank} Differential Revision: D31072120 fbshipit-source-id: 13de19a31a5cafeef280ed7b38a8372a4038fe89

facebook-github-bot added CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported labels Sep 21, 2021

facebook-github-bot closed this in 3015bca Sep 21, 2021

levythu mentioned this pull request Sep 13, 2023

Multitple fixes to MC modules to facilitate integration #1391

Open

duduyi2013 mentioned this pull request Nov 27, 2023

MCM Fix for ig integration #1541

Closed

xiexbing mentioned this pull request Oct 22, 2024

CUDA kernel error when using VBE #2502

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix parameter placement #3

fix parameter placement #3

Uh oh!

YazhiGao commented Sep 21, 2021

Uh oh!

facebook-github-bot commented Sep 21, 2021

Uh oh!

Uh oh!

fix parameter placement #3

fix parameter placement #3

Uh oh!

Conversation

YazhiGao commented Sep 21, 2021

Uh oh!

facebook-github-bot commented Sep 21, 2021

Uh oh!

Uh oh!