Skip to content

Multitple fixes to MC modules to facilitate integration #1391

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

levythu
Copy link
Contributor

@levythu levythu commented Sep 13, 2023

Summary:
Some bug fixes during the integration test in PyPER O3:

fix #1

_embedding_bag_collection (ShardedEmbeddingBagCollection) is not really called by input_dist (because the same thing is already distributed by ShardedManagedCollisionCollection) . So it never get a chance to initiate _input_dist. As a result, TREC pipelining thinks it's not ready for input distribution.

This is not expected, since the module is not used in the stage anyway, nor should it be put in fused a2a communication. With this change, https://fburl.com/code/ud8lnixv it'll satisfy the assertion, meanwhile doesn't carry _input_dists so won't be put into fused a2a.

fix #2

ManagedCollisionCollection.forward is not traceable because it uses unwarpped KeyedJaggedTensor.from_jt_dict. We don't care about its internal detail so just keep it atomic.

fix #3

Due to how remap table is set, MCHManagedCollisionModule doesn't support i32 id list for now. An easy fix is to convert to i64 regardless. A more memory efficient fix is probably change the remapper to i32 if necessary

Differential Revision: D48804332

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 13, 2023
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D48804332

levythu pushed a commit to levythu/torchrec that referenced this pull request Sep 18, 2023
Summary:

Some bug fixes during the integration test in PyPER O3:

### fix pytorch#1

 `_embedding_bag_collection` (`ShardedEmbeddingBagCollection`) is not really called by input_dist (because the same thing is already distributed by ShardedManagedCollisionCollection) . So it never get a chance to initiate `_input_dist`. As a result, TREC pipelining thinks it's not ready for input distribution.

This is not expected, since the module is not used in the stage anyway, nor should it be put in fused a2a communication. With this change, https://fburl.com/code/ud8lnixv it'll satisfy the assertion, meanwhile doesn't carry _input_dists so won't be put into fused a2a.

### fix pytorch#2

ManagedCollisionCollection.forward is not traceable because it uses unwarpped `KeyedJaggedTensor.from_jt_dict`. We don't care about its internal detail so just keep it atomic.

### fix pytorch#3

Due to how remap table is set, `MCHManagedCollisionModule` doesn't support i32 id list for now. An easy fix is to convert to i64 regardless. A more memory efficient fix is probably change the remapper to i32 if necessary

Differential Revision: D48804332
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D48804332

levythu pushed a commit to levythu/torchrec that referenced this pull request Sep 19, 2023
Summary:

Some bug fixes during the integration test in PyPER O3:

### fix pytorch#1

 `_embedding_bag_collection` (`ShardedEmbeddingBagCollection`) is not really called by input_dist (because the same thing is already distributed by ShardedManagedCollisionCollection) . So it never get a chance to initiate `_input_dist`. As a result, TREC pipelining thinks it's not ready for input distribution.

This is not expected, since the module is not used in the stage anyway, nor should it be put in fused a2a communication. With this change, https://fburl.com/code/ud8lnixv it'll satisfy the assertion, meanwhile doesn't carry _input_dists so won't be put into fused a2a.

### fix pytorch#2

ManagedCollisionCollection.forward is not traceable because it uses unwarpped `KeyedJaggedTensor.from_jt_dict`. We don't care about its internal detail so just keep it atomic.

### fix pytorch#3

Due to how remap table is set, `MCHManagedCollisionModule` doesn't support i32 id list for now. An easy fix is to convert to i64 regardless. A more memory efficient fix is probably change the remapper to i32 if necessary

Differential Revision: D48804332
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D48804332

Summary:

Some bug fixes during the integration test in PyPER O3:

### fix pytorch#1

 `_embedding_bag_collection` (`ShardedEmbeddingBagCollection`) is not really called by input_dist (because the same thing is already distributed by ShardedManagedCollisionCollection) . So it never get a chance to initiate `_input_dist`. As a result, TREC pipelining thinks it's not ready for input distribution.

This is not expected, since the module is not used in the stage anyway, nor should it be put in fused a2a communication. With this change, https://fburl.com/code/ud8lnixv it'll satisfy the assertion, meanwhile doesn't carry _input_dists so won't be put into fused a2a.

### fix pytorch#2

ManagedCollisionCollection.forward is not traceable because it uses unwarpped `KeyedJaggedTensor.from_jt_dict`. We don't care about its internal detail so just keep it atomic.

### fix pytorch#3

Due to how remap table is set, `MCHManagedCollisionModule` doesn't support i32 id list for now. An easy fix is to convert to i64 regardless. A more memory efficient fix is probably change the remapper to i32 if necessary

Differential Revision: D48804332
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D48804332

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants