Commit 9c0856e
3/n Support for multiple shards per table (#2877)
Summary:
Pull Request resolved: #2877
This will be crucial for any non-TW sharding type going through dynamic sharding. Handles the case where the embedding dimension is not the same across shards.
For example:
```
num_embeddings = 4
embedding_dim = 16
Table 0: CW sharded across ranks: [0, 1]
Table 1: CW sharded across rank: [0]
Table 0 shard 0 size: [8, 4]
Table 1 shard 0 size: [16, 4]
```
This will require `Table 0 shard 0` and `Table 1 shard 0` to be concatenated in dimension 1.
Main changes:
## All_to_all Collective input/output tensor composition & processing
1. Concatenating `local_input_tensor` to `all_to_all` collective by dimension 1 instead of 0. This is because dim 0 is variable for each shard depending , while dim 1 is consistently the same across all shards/tables as it is the number of embeddings.
2. This means we need to **transpose**, and properly process both the `local_input_tensor` and `local_output_tensor` to be passed into the `all_to_all` collective.
3. Made small optimization to the `local_output_tensor` to not be consistently updated via `torch.concat` since we only need the final dimensions for the empty tensor.
## Correct Order of `all_to_all` tensor output
To handle multiple shards per table, we need to properly store the **order** which the `all_to_all` collective is collecting the tensors across ranks. The order of shards composing the `local_output_tensor` is:
1. First ordered by rank
2. Then ordered by table order in the EBC -> this can be inferred from the `module_sharding_plan`
3. Finally by the shard_order for this table.
* Since we can assume each rank only contain 1 shard per table, we only need to track 1. and 2. The return type of `shards_all_to_all`, and input type of `update_state_dict_post_resharding` is updated to be a flattened list of the above order.
* Also, I'm storing the `shard_size` in dim 1 for this output while composing the `local_output_tensor`, to avoid needing to re-query in `update_state_dict_post_resharding`.
This will ensure correct behavior in the CW sharding implementation/test in the next diff.
Reviewed By: iamzainhuda
Differential Revision: D72486367
fbshipit-source-id: c444434b129f8ab9bf678a7f58a77cf99063b6451 parent 9855733 commit 9c0856e
File tree
2 files changed
+61
-28
lines changed- torchrec/distributed
- sharding
2 files changed
+61
-28
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1526 | 1526 | | |
1527 | 1527 | | |
1528 | 1528 | | |
1529 | | - | |
| 1529 | + | |
1530 | 1530 | | |
1531 | 1531 | | |
1532 | 1532 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
27 | 27 | | |
28 | 28 | | |
29 | 29 | | |
30 | | - | |
| 30 | + | |
31 | 31 | | |
32 | 32 | | |
33 | 33 | | |
34 | 34 | | |
35 | 35 | | |
36 | | - | |
37 | | - | |
| 36 | + | |
| 37 | + | |
38 | 38 | | |
39 | 39 | | |
40 | 40 | | |
| |||
47 | 47 | | |
48 | 48 | | |
49 | 49 | | |
50 | | - | |
51 | | - | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
52 | 53 | | |
53 | 54 | | |
54 | 55 | | |
| |||
62 | 63 | | |
63 | 64 | | |
64 | 65 | | |
65 | | - | |
66 | | - | |
67 | 66 | | |
68 | | - | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
69 | 72 | | |
70 | 73 | | |
71 | 74 | | |
| |||
84 | 87 | | |
85 | 88 | | |
86 | 89 | | |
87 | | - | |
88 | | - | |
89 | | - | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
90 | 93 | | |
91 | 94 | | |
92 | 95 | | |
93 | | - | |
94 | | - | |
95 | | - | |
96 | | - | |
97 | | - | |
| 96 | + | |
| 97 | + | |
98 | 98 | | |
99 | 99 | | |
100 | | - | |
101 | | - | |
102 | | - | |
| 100 | + | |
| 101 | + | |
103 | 102 | | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
104 | 113 | | |
105 | 114 | | |
106 | 115 | | |
107 | 116 | | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
108 | 136 | | |
109 | 137 | | |
110 | 138 | | |
| |||
115 | 143 | | |
116 | 144 | | |
117 | 145 | | |
118 | | - | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
119 | 153 | | |
120 | 154 | | |
121 | 155 | | |
122 | 156 | | |
123 | | - | |
| 157 | + | |
124 | 158 | | |
125 | 159 | | |
126 | 160 | | |
| |||
133 | 167 | | |
134 | 168 | | |
135 | 169 | | |
136 | | - | |
137 | | - | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
138 | 173 | | |
139 | 174 | | |
140 | 175 | | |
| |||
149 | 184 | | |
150 | 185 | | |
151 | 186 | | |
152 | | - | |
153 | 187 | | |
154 | 188 | | |
155 | 189 | | |
156 | | - | |
157 | | - | |
| 190 | + | |
158 | 191 | | |
159 | 192 | | |
160 | 193 | | |
161 | | - | |
| 194 | + | |
162 | 195 | | |
163 | 196 | | |
164 | 197 | | |
| |||
0 commit comments