-
Notifications
You must be signed in to change notification settings - Fork 617
Count shard state in HBM usage #3114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for pytorch-fbgemm-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
This pull request was exported from Phabricator. Differential Revision: D61576911 |
Summary: X-link: facebookresearch/FBGEMM#203 X-link: pytorch/FBGEMM#3114 This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * table_height + 8 * cache_height + 8 * cache_height`. This is becoming more nontrivial if we have a table with many rows but few dimensions. Impact: - Not UVM-offloaded job: NoOp - UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage . Differential Revision: D61576911
This pull request was exported from Phabricator. Differential Revision: D61576911 |
Summary: Pull Request resolved: pytorch#2380 X-link: facebookresearch/FBGEMM#203 X-link: pytorch/FBGEMM#3114 This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * table_height + 8 * cache_height + 8 * cache_height`. This is becoming more nontrivial if we have a table with many rows but few dimensions. Impact: - Not UVM-offloaded job: NoOp - UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage . Differential Revision: D61576911
Summary: X-link: pytorch/torchrec#2380 X-link: facebookresearch/FBGEMM#203 Pull Request resolved: pytorch#3114 This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * table_height + 8 * cache_height + 8 * cache_height`. This is becoming more nontrivial if we have a table with many rows but few dimensions. Impact: - Not UVM-offloaded job: NoOp - UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage . Differential Revision: D61576911
1549482
to
5a3d1ae
Compare
This pull request was exported from Phabricator. Differential Revision: D61576911 |
Summary: X-link: pytorch/torchrec#2380 X-link: facebookresearch/FBGEMM#203 Pull Request resolved: pytorch#3114 This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * table_height + 8 * cache_height + 8 * cache_height`. This is becoming more nontrivial if we have a table with many rows but few dimensions. Impact: - Not UVM-offloaded job: NoOp - UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage . Differential Revision: D61576911
5a3d1ae
to
1c3e63f
Compare
Summary: Pull Request resolved: pytorch#2380 X-link: facebookresearch/FBGEMM#203 X-link: pytorch/FBGEMM#3114 This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * table_height + 8 * cache_height + 8 * cache_height`. This is becoming more nontrivial if we have a table with many rows but few dimensions. Impact: - Not UVM-offloaded job: NoOp - UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage . Differential Revision: D61576911
This pull request was exported from Phabricator. Differential Revision: D61576911 |
Summary: X-link: pytorch/torchrec#2380 X-link: facebookresearch/FBGEMM#203 Pull Request resolved: pytorch#3114 This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * hash_size + 8 * cache_slot_size + 8 * cache_slot_size`. This is becoming more nontrivial if we have a table with many rows but few dimensions. Impact: - Not UVM-offloaded job: NoOp - UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage . Reviewed By: sarckk Differential Revision: D61576911
1c3e63f
to
f73cc5e
Compare
Summary: Pull Request resolved: pytorch#2380 X-link: facebookresearch/FBGEMM#203 X-link: pytorch/FBGEMM#3114 This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * hash_size + 8 * cache_slot_size + 8 * cache_slot_size`. This is becoming more nontrivial if we have a table with many rows but few dimensions. Impact: - Not UVM-offloaded job: NoOp - UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage . Reviewed By: sarckk Differential Revision: D61576911
This pull request was exported from Phabricator. Differential Revision: D61576911 |
Summary: X-link: pytorch/torchrec#2380 X-link: facebookresearch/FBGEMM#203 Pull Request resolved: pytorch#3114 This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * hash_size + 8 * cache_slot_size + 8 * cache_slot_size`. This is becoming more nontrivial if we have a table with many rows but few dimensions. Impact: - Not UVM-offloaded job: NoOp - UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage . Reviewed By: sarckk Differential Revision: D61576911
f73cc5e
to
38fee75
Compare
Summary: Pull Request resolved: pytorch#2380 X-link: facebookresearch/FBGEMM#203 X-link: pytorch/FBGEMM#3114 This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * hash_size + 8 * cache_slot_size + 8 * cache_slot_size`. This is becoming more nontrivial if we have a table with many rows but few dimensions. Impact: - Not UVM-offloaded job: NoOp - UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage . Reviewed By: sarckk Differential Revision: D61576911
Summary: X-link: pytorch/torchrec#2380 X-link: facebookresearch/FBGEMM#203 Pull Request resolved: pytorch#3114 This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * hash_size + 8 * cache_slot_size + 8 * cache_slot_size`. This is becoming more nontrivial if we have a table with many rows but few dimensions. Impact: - Not UVM-offloaded job: NoOp - UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage . Reviewed By: sarckk Differential Revision: D61576911
38fee75
to
c54e31b
Compare
This pull request was exported from Phabricator. Differential Revision: D61576911 |
Summary: Pull Request resolved: pytorch#2380 X-link: facebookresearch/FBGEMM#203 X-link: pytorch/FBGEMM#3114 This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * hash_size + 8 * cache_slot_size + 8 * cache_slot_size`. This is becoming more nontrivial if we have a table with many rows but few dimensions. Impact: - Not UVM-offloaded job: NoOp - UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage . Reviewed By: sarckk Differential Revision: D61576911
Summary: X-link: pytorch/torchrec#2380 X-link: facebookresearch/FBGEMM#203 Pull Request resolved: pytorch#3114 This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * hash_size + 8 * cache_slot_size + 8 * cache_slot_size`. This is becoming more nontrivial if we have a table with many rows but few dimensions. Impact: - Not UVM-offloaded job: NoOp - UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage . Reviewed By: sarckk Differential Revision: D61576911
Summary: Pull Request resolved: pytorch#2380 X-link: facebookresearch/FBGEMM#203 X-link: pytorch/FBGEMM#3114 This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * hash_size + 8 * cache_slot_size + 8 * cache_slot_size`. This is becoming more nontrivial if we have a table with many rows but few dimensions. Impact: - Not UVM-offloaded job: NoOp - UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage . Reviewed By: sarckk Differential Revision: D61576911
This pull request was exported from Phabricator. Differential Revision: D61576911 |
c54e31b
to
bc84822
Compare
Summary: Pull Request resolved: #2380 X-link: facebookresearch/FBGEMM#203 X-link: pytorch/FBGEMM#3114 This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * hash_size + 8 * cache_slot_size + 8 * cache_slot_size`. This is becoming more nontrivial if we have a table with many rows but few dimensions. Impact: - Not UVM-offloaded job: NoOp - UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage . Reviewed By: sarckk Differential Revision: D61576911 fbshipit-source-id: 6b501dc63cbe86c5274661b1d985af6a7a0a87c6
This pull request has been merged in 6a349df. |
Summary: X-link: pytorch/torchrec#2380 Pull Request resolved: facebookresearch/FBGEMM#203 X-link: pytorch#3114 This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * hash_size + 8 * cache_slot_size + 8 * cache_slot_size`. This is becoming more nontrivial if we have a table with many rows but few dimensions. Impact: - Not UVM-offloaded job: NoOp - UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage . Reviewed By: sarckk Differential Revision: D61576911 fbshipit-source-id: 6b501dc63cbe86c5274661b1d985af6a7a0a87c6
Summary:
This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is
4 * table_height + 8 * cache_height + 8 * cache_height
. This is becoming more nontrivial if we have a table with many rows but few dimensions.Impact:
Differential Revision: D61576911