Count shard state in HBM usage (pytorch#203)

levythu · facebook-github-bot · commit dbad57e22546 · 2024-09-12T00:12:31.000-07:00
Summary: X-link: pytorch/torchrec#2380 Pull Request resolved: facebookresearch/FBGEMM#203 X-link: pytorch#3114 This PR improve sparse HBM cost by accounting the size of auxilirary state for maintaining UVM cache. As noted in the comment of split_table_batched_embeddings_ops_training, for now the significant space is `4 * hash_size + 8 * cache_slot_size + 8 * cache_slot_size`. This is becoming more nontrivial if we have a table with many rows but few dimensions. Impact: - Not UVM-offloaded job: NoOp - UVM-offloaded job: More balanced memory usage from precise estimation, but for existing UVM jobs with scale up proposer + fixed percentage reservation this might lead to scale up proposer making less aggressive cache scale-up and therefore leading to worse performance. In this case we should tune to more slack reservation percentage . Reviewed By: sarckk Differential Revision: D61576911 fbshipit-source-id: 6b501dc63cbe86c5274661b1d985af6a7a0a87c6
diff --git a/fbgemm_gpu/fbgemm_gpu/split_table_batched_embeddings_ops_training.py b/fbgemm_gpu/fbgemm_gpu/split_table_batched_embeddings_ops_training.py
@@ -2286,6 +2286,7 @@ def _apply_cache_state(
         )
 
         self.total_cache_hash_size = cache_state.total_cache_hash_size
+        # 8x of # tables, trivial size
         self.register_buffer(
             "cache_hash_size_cumsum",
             torch.tensor(
@@ -2294,6 +2295,7 @@ def _apply_cache_state(
                 dtype=torch.int64,
             ),
         )
+        # 4x total embedding hash size with uvm cache
         self.register_buffer(
             "cache_index_table_map",
             torch.tensor(
@@ -2302,12 +2304,14 @@ def _apply_cache_state(
                 dtype=torch.int32,
             ),
         )
+        # 8x of total cache slots (embedding hash size * clf)
         self.register_buffer(
             "lxu_cache_state",
             torch.zeros(
                 cache_sets, DEFAULT_ASSOC, device=self.current_device, dtype=torch.int64
             ).fill_(-1),
         )
+        # Cache itself, not auxiliary size
         self.register_buffer(
             "lxu_cache_weights",
             torch.zeros(
@@ -2317,6 +2321,8 @@ def _apply_cache_state(
                 dtype=dtype,
             ),
         )
+        # LRU: 8x of total cache slots (embedding hash size * clf)
+        # LFU: 8x of total embedding hash size with uvm cache
         self.register_buffer(
             "lxu_state",
             torch.zeros(