@@ -29,6 +29,54 @@ not have to explicitly set them in your configurations. When data parallel size
2929adds the distributed data sampler to the dataloader to shard the dataset.
3030
3131
32+ ## 1D, 2D, 2.5D and 3D Parallel
33+ To enable hybrid parallelism, we provide an array of tensor parallelism. We provide the list of papers which match each
34+ tensor parallel method. These parallel modes need to work with the distributed layers provided by Colossal-AI.
35+ - 1D: [ Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism] ( https://arxiv.org/abs/1909.08053 )
36+
37+ - 2D: [ An Efficient 2D Method for Training Super-Large Deep Learning Models] ( https://arxiv.org/abs/2104.05343 )
38+ 2D parallel relies on the SUMMA matrix multiplication algorithm and splits the input data,
39+ model weights and layer outputs along two different dimensions. The tensor chunks are distributed over a 2D mesh of $P = N^2$
40+ devices where N is the number of tensor chunks in a single dimension.
41+
42+ - 2.5D: [ 2.5-dimensional distributed model training] ( https://arxiv.org/abs/2105.14500 )
43+ Inspired by the 2.5D matrix multi-plication algorithm, 2.5D parallel introduces a novel tensor parallelism which further
44+ parallelizes 2D tensor parallelism. An amount of $P = N^2 ∗ d$ processors are arranged into d layers,
45+ where each layer performs matrix multiplication operations independently with a dimension N.
46+
47+ - 3D: [ Maximizing Parallelism in Distributed Training for Huge Neural Networks] ( https://arxiv.org/abs/2105.14450 )
48+ We also introduce a 3D tensor parallelism that parallelizes neural networks on a 3D processor cube. This method achieves
49+ the optimal, $O(P^{1/3})$ communication overhead on P processors, while both computation and memory usage are evenly distributed
50+ through optimized load balancing of parameters as well as activations.
51+
52+
53+
54+ ``` python
55+ # 1D parallel
56+ parallel = dict (
57+ pipeline = dict (size = 1 ), # number of pipeline stages
58+ tensor = dict (size = 4 , mode = ' 1d' )
59+ )
60+
61+ # 2D parallel
62+ parallel = dict (
63+ pipeline = dict (size = 1 ), # number of pipeline stages
64+ tensor = dict (size = 4 , mode = ' 2d' )
65+ )
66+
67+ # 2.5D parallel
68+ parallel = dict (
69+ pipeline = dict (size = 1 ), # number of pipeline stages
70+ tensor = dict (size = 8 , mode = ' 2.5d' , depth = 2 )
71+ )
72+
73+ # 3D parallel
74+ parallel = dict (
75+ pipeline = dict (size = 1 ), # number of pipeline stages
76+ tensor = dict (size = 8 , mode = ' 3d' )
77+ )
78+ ```
79+
3280## Pipeline Parallel (experimental)
3381
3482Pipeline parallelism is to split the model into several partitions by layer. For example, let's assume we have a simple
@@ -160,57 +208,9 @@ schedule = dict(
160208)
161209```
162210
163- ## 1D, 2D, 2.5D and 3D Parallel
164- To enable hybrid parallelism, we provide an array of tensor parallelism. We provide the list of papers which match each
165- tensor parallel method. These parallel modes need to work with the distributed layers provided by Colossal-AI.
166- - 1D: [ Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism] ( https://arxiv.org/abs/1909.08053 )
167-
168- - 2D: [ An Efficient 2D Method for Training Super-Large Deep Learning Models] ( https://arxiv.org/abs/2104.05343 )
169- 2D parallel relies on the SUMMA matrix multiplication algorithm and splits the input data,
170- model weights and layer outputs along two different dimensions. The tensor chunks are distributed over a 2D mesh of $P = N^2$
171- devices where N is the number of tensor chunks in a single dimension.
172-
173- - 2.5D: [ 2.5-dimensional distributed model training] ( https://arxiv.org/abs/2105.14500 )
174- Inspired by the 2.5D matrix multi-plication algorithm, 2.5D parallel introduces a novel tensor parallelism which further
175- parallelizes 2D tensor parallelism. An amount of $P = N^2 ∗ d$ processors are arranged into d layers,
176- where each layer performs matrix multiplication operations independently with a dimension N.
177-
178- - 3D: [ Maximizing Parallelism in Distributed Training for Huge Neural Networks] ( https://arxiv.org/abs/2105.14450 )
179- We also introduce a 3D tensor parallelism that parallelizes neural networks on a 3D processor cube. This method achieves
180- the optimal, $O(P^{1/3})$ communication overhead on P processors, while both computation and memory usage are evenly distributed
181- through optimized load balancing of parameters as well as activations.
182-
183-
184-
185- ``` python
186- # 1D parallel
187- parallel = dict (
188- pipeline = dict (size = 1 ), # number of pipeline stages
189- tensor = dict (size = 4 , mode = ' 1d' )
190- )
191-
192- # 2D parallel
193- parallel = dict (
194- pipeline = dict (size = 1 ), # number of pipeline stages
195- tensor = dict (size = 4 , mode = ' 2d' )
196- )
197-
198- # 2.5D parallel
199- parallel = dict (
200- pipeline = dict (size = 1 ), # number of pipeline stages
201- tensor = dict (size = 8 , mode = ' 2.5d' , depth = 2 )
202- )
203-
204- # 3D parallel
205- parallel = dict (
206- pipeline = dict (size = 1 ), # number of pipeline stages
207- tensor = dict (size = 8 , mode = ' 3d' )
208- )
209- ```
210-
211211
212212## Sequence Parallel (experimental)
213213
214214Sequence parallel is to support long-sequence modelling such as document-level text understanding and medical imaging.
215215This method is proposed in [ Sequence Parallelism: Making 4D Parallelism Possible] ( https://arxiv.org/abs/2105.13120 ) .
216- This feature is still in development is only experimental for now.
216+ This feature is still in development is only experimental for now.
0 commit comments