hpcaitech · FrankLeeeee · Nov 2, 2021 · Nov 1, 2021
@@ -29,6 +29,54 @@ not have to explicitly set them in your configurations. When data parallel size
 adds the distributed data sampler to the dataloader to shard the dataset.
 
 
+## 1D, 2D, 2.5D and 3D Parallel
+To enable hybrid parallelism, we provide an array of tensor parallelism. We provide the list of papers which match each 
+tensor parallel method. These parallel modes need to work with the distributed layers provided by Colossal-AI.
+- 1D: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
+
+- 2D: [An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343)  
+2D parallel relies on the SUMMA matrix multiplication algorithm and splits the input data, 
+model weights and layer outputs along two different dimensions. The tensor chunks are distributed over a 2D mesh of $P = N^2$ 
+devices where N is the number of tensor chunks in a single dimension.
+
+- 2.5D: [2.5-dimensional distributed model training](https://arxiv.org/abs/2105.14500)  
+Inspired by the 2.5D matrix multi-plication algorithm, 2.5D parallel introduces a novel tensor parallelism which further 
+parallelizes 2D tensor parallelism. An amount of $P = N^2 ∗ d$ processors are arranged into d layers, 
+where each layer performs matrix multiplication operations independently with a dimension N.
+
+- 3D: [Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/abs/2105.14450)  
+We also introduce a 3D tensor parallelism that parallelizes neural networks on a 3D processor cube. This method achieves 
+the optimal, $O(P^{1/3})$ communication overhead on P processors, while both computation and memory usage are evenly distributed 
+through optimized load balancing of parameters as well as activations.
+
+
+
+```python
+# 1D parallel
+parallel = dict(
+    pipeline=dict(size=1), # number of pipeline stages
+    tensor=dict(size=4, mode='1d')
+)
+
+# 2D parallel
+parallel = dict(
+    pipeline=dict(size=1), # number of pipeline stages
+    tensor=dict(size=4, mode='2d')
+)
+
+# 2.5D parallel
+parallel = dict(
+    pipeline=dict(size=1), # number of pipeline stages
+    tensor=dict(size=8, mode='2.5d', depth=2)
+)
+
+# 3D parallel
+parallel = dict(
+    pipeline=dict(size=1), # number of pipeline stages
+    tensor=dict(size=8, mode='3d')
+)
+```
+
 ## Pipeline Parallel (experimental)
 
 Pipeline parallelism is to split the model into several partitions by layer. For example, let's assume we have a simple 
@@ -160,57 +208,9 @@ schedule = dict(
 )
 ```
 
-## 1D, 2D, 2.5D and 3D Parallel
-To enable hybrid parallelism, we provide an array of tensor parallelism. We provide the list of papers which match each 
-tensor parallel method. These parallel modes need to work with the distributed layers provided by Colossal-AI.
-- 1D: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
-
-- 2D: [An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343)  
-2D parallel relies on the SUMMA matrix multiplication algorithm and splits the input data, 
-model weights and layer outputs along two different dimensions. The tensor chunks are distributed over a 2D mesh of $P = N^2$ 
-devices where N is the number of tensor chunks in a single dimension.
-
-- 2.5D: [2.5-dimensional distributed model training](https://arxiv.org/abs/2105.14500)  
-Inspired by the 2.5D matrix multi-plication algorithm, 2.5D parallel introduces a novel tensor parallelism which further 
-parallelizes 2D tensor parallelism. An amount of $P = N^2 ∗ d$ processors are arranged into d layers, 
-where each layer performs matrix multiplication operations independently with a dimension N.
-
-- 3D: [Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/abs/2105.14450)  
-We also introduce a 3D tensor parallelism that parallelizes neural networks on a 3D processor cube. This method achieves 
-the optimal, $O(P^{1/3})$ communication overhead on P processors, while both computation and memory usage are evenly distributed 
-through optimized load balancing of parameters as well as activations.
-
-
-
-```python
-# 1D parallel
-parallel = dict(
-    pipeline=dict(size=1), # number of pipeline stages
-    tensor=dict(size=4, mode='1d')
-)
-
-# 2D parallel
-parallel = dict(
-    pipeline=dict(size=1), # number of pipeline stages
-    tensor=dict(size=4, mode='2d')
-)
-
-# 2.5D parallel
-parallel = dict(
-    pipeline=dict(size=1), # number of pipeline stages
-    tensor=dict(size=8, mode='2.5d', depth=2)
-)
-
-# 3D parallel
-parallel = dict(
-    pipeline=dict(size=1), # number of pipeline stages
-    tensor=dict(size=8, mode='3d')
-)
-```
-
 
 ## Sequence Parallel (experimental)
 
 Sequence parallel is to support long-sequence modelling such as document-level text understanding and medical imaging. 
 This method is proposed in [Sequence Parallelism: Making 4D Parallelism Possible](https://arxiv.org/abs/2105.13120). 
-This feature is still in development is only experimental for now.
+This feature is still in development is only experimental for now.