Count shard state in HBM usage #3114

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

levythu wants to merge 1 commit into pytorch:main from levythu:export-D61576911

Contributor

levythu commented Sep 11, 2024

Summary:
This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is 4 * table_height + 8 * cache_height + 8 * cache_height. This is becoming more nontrivial if we have a table with many rows but few dimensions.

Impact:

Not UVM-offloaded job: NoOp
UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage .

Differential Revision: D61576911

netlify bot commented Sep 11, 2024 •

edited

Loading

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`bc84822`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/66e269236b0e0f000834ad6c
😎 Deploy Preview	https://deploy-preview-3114--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

facebook-github-bot added the cla signed label

Contributor

facebook-github-bot commented Sep 11, 2024

This pull request was exported from Phabricator. Differential Revision: D61576911

facebook-github-bot added the fb-exported label

levythu mentioned this pull request

Count shard state in HBM usage pytorch/torchrec#2380

Closed

levythu added a commit to levythu/torchrec that referenced this pull request


          Count shard state in HBM usage

9c2e279

Summary:
X-link: facebookresearch/FBGEMM#203

X-link: pytorch/FBGEMM#3114

This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * table_height + 8 * cache_height + 8 * cache_height`. This is becoming more nontrivial if we have a table with many rows but few dimensions.

Impact:
- Not UVM-offloaded job: NoOp
- UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage .

Differential Revision: D61576911

Contributor

facebook-github-bot commented Sep 11, 2024

This pull request was exported from Phabricator. Differential Revision: D61576911

levythu added a commit to levythu/torchrec that referenced this pull request


          Count shard state in HBM usage (pytorch#2380)

184aa78

Summary:
Pull Request resolved: pytorch#2380

X-link: facebookresearch/FBGEMM#203

X-link: pytorch/FBGEMM#3114

This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * table_height + 8 * cache_height + 8 * cache_height`. This is becoming more nontrivial if we have a table with many rows but few dimensions.

Impact:
- Not UVM-offloaded job: NoOp
- UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage .

Differential Revision: D61576911

levythu added a commit to levythu/FBGEMM that referenced this pull request


          Count shard state in HBM usage (pytorch#3114)

5a3d1ae

Summary:
X-link: pytorch/torchrec#2380

X-link: facebookresearch/FBGEMM#203

Pull Request resolved: pytorch#3114

This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * table_height + 8 * cache_height + 8 * cache_height`. This is becoming more nontrivial if we have a table with many rows but few dimensions.

Impact:
- Not UVM-offloaded job: NoOp
- UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage .

Differential Revision: D61576911

levythu force-pushed the export-D61576911 branch from 1549482 to 5a3d1ae Compare

September 11, 2024 22:39

Contributor

facebook-github-bot commented Sep 11, 2024

This pull request was exported from Phabricator. Differential Revision: D61576911

levythu added a commit to levythu/FBGEMM that referenced this pull request


          Count shard state in HBM usage (pytorch#3114)

1c3e63f

Summary:
X-link: pytorch/torchrec#2380

X-link: facebookresearch/FBGEMM#203

Pull Request resolved: pytorch#3114

This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * table_height + 8 * cache_height + 8 * cache_height`. This is becoming more nontrivial if we have a table with many rows but few dimensions.

Impact:
- Not UVM-offloaded job: NoOp
- UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage .

Differential Revision: D61576911

levythu force-pushed the export-D61576911 branch from 5a3d1ae to 1c3e63f Compare

September 11, 2024 22:43

levythu added a commit to levythu/torchrec that referenced this pull request


          Count shard state in HBM usage (pytorch#2380)

7f10d24

Summary:
Pull Request resolved: pytorch#2380

X-link: facebookresearch/FBGEMM#203

X-link: pytorch/FBGEMM#3114

This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * table_height + 8 * cache_height + 8 * cache_height`. This is becoming more nontrivial if we have a table with many rows but few dimensions.

Impact:
- Not UVM-offloaded job: NoOp
- UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage .

Differential Revision: D61576911

Contributor

facebook-github-bot commented Sep 12, 2024

This pull request was exported from Phabricator. Differential Revision: D61576911

levythu added a commit to levythu/FBGEMM that referenced this pull request


          Count shard state in HBM usage (pytorch#3114)

f73cc5e

Summary:
X-link: pytorch/torchrec#2380

X-link: facebookresearch/FBGEMM#203

Pull Request resolved: pytorch#3114

This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * hash_size + 8 * cache_slot_size + 8 * cache_slot_size`. This is becoming more nontrivial if we have a table with many rows but few dimensions.

Impact:
- Not UVM-offloaded job: NoOp
- UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage .

Reviewed By: sarckk

Differential Revision: D61576911

levythu force-pushed the export-D61576911 branch from 1c3e63f to f73cc5e Compare

September 12, 2024 03:11

levythu added a commit to levythu/torchrec that referenced this pull request


          Count shard state in HBM usage (pytorch#2380)

6dd8a0e

Summary:
Pull Request resolved: pytorch#2380

X-link: facebookresearch/FBGEMM#203

X-link: pytorch/FBGEMM#3114

This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * hash_size + 8 * cache_slot_size + 8 * cache_slot_size`. This is becoming more nontrivial if we have a table with many rows but few dimensions.

Impact:
- Not UVM-offloaded job: NoOp
- UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage .

Reviewed By: sarckk

Differential Revision: D61576911

Contributor

facebook-github-bot commented Sep 12, 2024

This pull request was exported from Phabricator. Differential Revision: D61576911

levythu added a commit to levythu/FBGEMM that referenced this pull request


          Count shard state in HBM usage (pytorch#3114)

38fee75

Summary:
X-link: pytorch/torchrec#2380

X-link: facebookresearch/FBGEMM#203

Pull Request resolved: pytorch#3114

This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * hash_size + 8 * cache_slot_size + 8 * cache_slot_size`. This is becoming more nontrivial if we have a table with many rows but few dimensions.

Impact:
- Not UVM-offloaded job: NoOp
- UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage .

Reviewed By: sarckk

Differential Revision: D61576911

levythu force-pushed the export-D61576911 branch from f73cc5e to 38fee75 Compare

September 12, 2024 03:15

levythu added a commit to levythu/torchrec that referenced this pull request


          Count shard state in HBM usage (pytorch#2380)

7b2bd6b

Summary:
Pull Request resolved: pytorch#2380

X-link: facebookresearch/FBGEMM#203

X-link: pytorch/FBGEMM#3114

This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * hash_size + 8 * cache_slot_size + 8 * cache_slot_size`. This is becoming more nontrivial if we have a table with many rows but few dimensions.

Impact:
- Not UVM-offloaded job: NoOp
- UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage .

Reviewed By: sarckk

Differential Revision: D61576911

levythu added a commit to levythu/FBGEMM that referenced this pull request


          Count shard state in HBM usage (pytorch#3114)

c54e31b

Summary:
X-link: pytorch/torchrec#2380

X-link: facebookresearch/FBGEMM#203

Pull Request resolved: pytorch#3114

This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * hash_size + 8 * cache_slot_size + 8 * cache_slot_size`. This is becoming more nontrivial if we have a table with many rows but few dimensions.

Impact:
- Not UVM-offloaded job: NoOp
- UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage .

Reviewed By: sarckk

Differential Revision: D61576911

levythu force-pushed the export-D61576911 branch from 38fee75 to c54e31b Compare

September 12, 2024 04:03

Contributor

facebook-github-bot commented Sep 12, 2024

This pull request was exported from Phabricator. Differential Revision: D61576911

levythu added a commit to levythu/torchrec that referenced this pull request


          Count shard state in HBM usage (pytorch#2380)

a6ccb2c

Summary:
Pull Request resolved: pytorch#2380

X-link: facebookresearch/FBGEMM#203

X-link: pytorch/FBGEMM#3114

This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * hash_size + 8 * cache_slot_size + 8 * cache_slot_size`. This is becoming more nontrivial if we have a table with many rows but few dimensions.

Impact:
- Not UVM-offloaded job: NoOp
- UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage .

Reviewed By: sarckk

Differential Revision: D61576911


          Count shard state in HBM usage (pytorch#3114)

bc84822

Summary:
X-link: pytorch/torchrec#2380

X-link: facebookresearch/FBGEMM#203

Pull Request resolved: pytorch#3114

This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * hash_size + 8 * cache_slot_size + 8 * cache_slot_size`. This is becoming more nontrivial if we have a table with many rows but few dimensions.

Impact:
- Not UVM-offloaded job: NoOp
- UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage .

Reviewed By: sarckk

Differential Revision: D61576911

levythu added a commit to levythu/torchrec that referenced this pull request


          Count shard state in HBM usage (pytorch#2380)

96baefa

Summary:
Pull Request resolved: pytorch#2380

X-link: facebookresearch/FBGEMM#203

X-link: pytorch/FBGEMM#3114

This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * hash_size + 8 * cache_slot_size + 8 * cache_slot_size`. This is becoming more nontrivial if we have a table with many rows but few dimensions.

Impact:
- Not UVM-offloaded job: NoOp
- UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage .

Reviewed By: sarckk

Differential Revision: D61576911

Contributor

facebook-github-bot commented Sep 12, 2024

This pull request was exported from Phabricator. Differential Revision: D61576911

levythu force-pushed the export-D61576911 branch from c54e31b to bc84822 Compare

September 12, 2024 04:08

facebook-github-bot pushed a commit to pytorch/torchrec that referenced this pull request


          Count shard state in HBM usage (#2380)

760758f

Summary:
Pull Request resolved: #2380

X-link: facebookresearch/FBGEMM#203

X-link: pytorch/FBGEMM#3114

This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * hash_size + 8 * cache_slot_size + 8 * cache_slot_size`. This is becoming more nontrivial if we have a table with many rows but few dimensions.

Impact:
- Not UVM-offloaded job: NoOp
- UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage .

Reviewed By: sarckk

Differential Revision: D61576911

fbshipit-source-id: 6b501dc63cbe86c5274661b1d985af6a7a0a87c6

facebook-github-bot closed this in

6a349df

facebook-github-bot added the Merged label

Contributor

facebook-github-bot commented Sep 12, 2024

This pull request has been merged in 6a349df.

q10 pushed a commit to q10/FBGEMM that referenced this pull request


          Count shard state in HBM usage (pytorch#203)

dbad57e

Summary:
X-link: pytorch/torchrec#2380

Pull Request resolved: facebookresearch/FBGEMM#203

X-link: pytorch#3114

This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * hash_size + 8 * cache_slot_size + 8 * cache_slot_size`. This is becoming more nontrivial if we have a table with many rows but few dimensions.

Impact:
- Not UVM-offloaded job: NoOp
- UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage .

Reviewed By: sarckk

Differential Revision: D61576911

fbshipit-source-id: 6b501dc63cbe86c5274661b1d985af6a7a0a87c6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed fb-exported Merged