Skip to content

Count shard state in HBM usage #3114

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

levythu
Copy link
Contributor

@levythu levythu commented Sep 11, 2024

Summary:
This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is 4 * table_height + 8 * cache_height + 8 * cache_height. This is becoming more nontrivial if we have a table with many rows but few dimensions.

Impact:

  • Not UVM-offloaded job: NoOp
  • UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage .

Differential Revision: D61576911

Copy link

netlify bot commented Sep 11, 2024

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit bc84822
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/66e269236b0e0f000834ad6c
😎 Deploy Preview https://deploy-preview-3114--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D61576911

levythu added a commit to levythu/torchrec that referenced this pull request Sep 11, 2024
Summary:
X-link: facebookresearch/FBGEMM#203

X-link: pytorch/FBGEMM#3114

This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * table_height + 8 * cache_height + 8 * cache_height`. This is becoming more nontrivial if we have a table with many rows but few dimensions.

Impact:
- Not UVM-offloaded job: NoOp
- UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage .

Differential Revision: D61576911
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D61576911

levythu added a commit to levythu/torchrec that referenced this pull request Sep 11, 2024
Summary:
Pull Request resolved: pytorch#2380

X-link: facebookresearch/FBGEMM#203

X-link: pytorch/FBGEMM#3114

This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * table_height + 8 * cache_height + 8 * cache_height`. This is becoming more nontrivial if we have a table with many rows but few dimensions.

Impact:
- Not UVM-offloaded job: NoOp
- UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage .

Differential Revision: D61576911
levythu added a commit to levythu/FBGEMM that referenced this pull request Sep 11, 2024
Summary:
X-link: pytorch/torchrec#2380

X-link: facebookresearch/FBGEMM#203

Pull Request resolved: pytorch#3114

This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * table_height + 8 * cache_height + 8 * cache_height`. This is becoming more nontrivial if we have a table with many rows but few dimensions.

Impact:
- Not UVM-offloaded job: NoOp
- UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage .

Differential Revision: D61576911
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D61576911

levythu added a commit to levythu/FBGEMM that referenced this pull request Sep 11, 2024
Summary:
X-link: pytorch/torchrec#2380

X-link: facebookresearch/FBGEMM#203

Pull Request resolved: pytorch#3114

This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * table_height + 8 * cache_height + 8 * cache_height`. This is becoming more nontrivial if we have a table with many rows but few dimensions.

Impact:
- Not UVM-offloaded job: NoOp
- UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage .

Differential Revision: D61576911
levythu added a commit to levythu/torchrec that referenced this pull request Sep 11, 2024
Summary:
Pull Request resolved: pytorch#2380

X-link: facebookresearch/FBGEMM#203

X-link: pytorch/FBGEMM#3114

This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * table_height + 8 * cache_height + 8 * cache_height`. This is becoming more nontrivial if we have a table with many rows but few dimensions.

Impact:
- Not UVM-offloaded job: NoOp
- UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage .

Differential Revision: D61576911
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D61576911

levythu added a commit to levythu/FBGEMM that referenced this pull request Sep 12, 2024
Summary:
X-link: pytorch/torchrec#2380

X-link: facebookresearch/FBGEMM#203

Pull Request resolved: pytorch#3114

This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * hash_size + 8 * cache_slot_size + 8 * cache_slot_size`. This is becoming more nontrivial if we have a table with many rows but few dimensions.

Impact:
- Not UVM-offloaded job: NoOp
- UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage .

Reviewed By: sarckk

Differential Revision: D61576911
levythu added a commit to levythu/torchrec that referenced this pull request Sep 12, 2024
Summary:
Pull Request resolved: pytorch#2380

X-link: facebookresearch/FBGEMM#203

X-link: pytorch/FBGEMM#3114

This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * hash_size + 8 * cache_slot_size + 8 * cache_slot_size`. This is becoming more nontrivial if we have a table with many rows but few dimensions.

Impact:
- Not UVM-offloaded job: NoOp
- UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage .

Reviewed By: sarckk

Differential Revision: D61576911
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D61576911

levythu added a commit to levythu/FBGEMM that referenced this pull request Sep 12, 2024
Summary:
X-link: pytorch/torchrec#2380

X-link: facebookresearch/FBGEMM#203

Pull Request resolved: pytorch#3114

This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * hash_size + 8 * cache_slot_size + 8 * cache_slot_size`. This is becoming more nontrivial if we have a table with many rows but few dimensions.

Impact:
- Not UVM-offloaded job: NoOp
- UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage .

Reviewed By: sarckk

Differential Revision: D61576911
levythu added a commit to levythu/torchrec that referenced this pull request Sep 12, 2024
Summary:
Pull Request resolved: pytorch#2380

X-link: facebookresearch/FBGEMM#203

X-link: pytorch/FBGEMM#3114

This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * hash_size + 8 * cache_slot_size + 8 * cache_slot_size`. This is becoming more nontrivial if we have a table with many rows but few dimensions.

Impact:
- Not UVM-offloaded job: NoOp
- UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage .

Reviewed By: sarckk

Differential Revision: D61576911
levythu added a commit to levythu/FBGEMM that referenced this pull request Sep 12, 2024
Summary:
X-link: pytorch/torchrec#2380

X-link: facebookresearch/FBGEMM#203

Pull Request resolved: pytorch#3114

This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * hash_size + 8 * cache_slot_size + 8 * cache_slot_size`. This is becoming more nontrivial if we have a table with many rows but few dimensions.

Impact:
- Not UVM-offloaded job: NoOp
- UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage .

Reviewed By: sarckk

Differential Revision: D61576911
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D61576911

levythu added a commit to levythu/torchrec that referenced this pull request Sep 12, 2024
Summary:
Pull Request resolved: pytorch#2380

X-link: facebookresearch/FBGEMM#203

X-link: pytorch/FBGEMM#3114

This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * hash_size + 8 * cache_slot_size + 8 * cache_slot_size`. This is becoming more nontrivial if we have a table with many rows but few dimensions.

Impact:
- Not UVM-offloaded job: NoOp
- UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage .

Reviewed By: sarckk

Differential Revision: D61576911
Summary:
X-link: pytorch/torchrec#2380

X-link: facebookresearch/FBGEMM#203

Pull Request resolved: pytorch#3114

This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * hash_size + 8 * cache_slot_size + 8 * cache_slot_size`. This is becoming more nontrivial if we have a table with many rows but few dimensions.

Impact:
- Not UVM-offloaded job: NoOp
- UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage .

Reviewed By: sarckk

Differential Revision: D61576911
levythu added a commit to levythu/torchrec that referenced this pull request Sep 12, 2024
Summary:
Pull Request resolved: pytorch#2380

X-link: facebookresearch/FBGEMM#203

X-link: pytorch/FBGEMM#3114

This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * hash_size + 8 * cache_slot_size + 8 * cache_slot_size`. This is becoming more nontrivial if we have a table with many rows but few dimensions.

Impact:
- Not UVM-offloaded job: NoOp
- UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage .

Reviewed By: sarckk

Differential Revision: D61576911
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D61576911

facebook-github-bot pushed a commit to pytorch/torchrec that referenced this pull request Sep 12, 2024
Summary:
Pull Request resolved: #2380

X-link: facebookresearch/FBGEMM#203

X-link: pytorch/FBGEMM#3114

This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * hash_size + 8 * cache_slot_size + 8 * cache_slot_size`. This is becoming more nontrivial if we have a table with many rows but few dimensions.

Impact:
- Not UVM-offloaded job: NoOp
- UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage .

Reviewed By: sarckk

Differential Revision: D61576911

fbshipit-source-id: 6b501dc63cbe86c5274661b1d985af6a7a0a87c6
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 6a349df.

q10 pushed a commit to q10/FBGEMM that referenced this pull request Apr 10, 2025
Summary:
X-link: pytorch/torchrec#2380

Pull Request resolved: facebookresearch/FBGEMM#203

X-link: pytorch#3114

This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * hash_size + 8 * cache_slot_size + 8 * cache_slot_size`. This is becoming more nontrivial if we have a table with many rows but few dimensions.

Impact:
- Not UVM-offloaded job: NoOp
- UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage .

Reviewed By: sarckk

Differential Revision: D61576911

fbshipit-source-id: 6b501dc63cbe86c5274661b1d985af6a7a0a87c6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants