hpcaitech
diff --git a/‎README.md‎
Lines changed: 15 additions & 4 deletions b/‎README.md‎
Lines changed: 15 additions & 4 deletions
diff --git a/‎docs/add_your_parallel.md‎
Lines changed: 3 additions & 3 deletions b/‎docs/add_your_parallel.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎docs/add_your_parallel_zh.md‎
Lines changed: 7 additions & 16 deletions b/‎docs/add_your_parallel_zh.md‎
Lines changed: 7 additions & 16 deletions
diff --git a/‎docs/amp.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/amp.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/amp_zh.md‎
Lines changed: 6 additions & 11 deletions b/‎docs/amp_zh.md‎
Lines changed: 6 additions & 11 deletions
diff --git a/‎docs/conf.py‎
Lines changed: 1 addition & 1 deletion b/‎docs/conf.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/config.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/config.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/config_zh.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/config_zh.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/index.rst‎
Lines changed: 2 additions & 2 deletions b/‎docs/index.rst‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/index_en.rst‎
Lines changed: 2 additions & 2 deletions b/‎docs/index_en.rst‎
Lines changed: 2 additions & 2 deletions
@@ -1,8 +1,10 @@
-# ColossalAI
+# Colossal-AI
 
 An integrated large-scale model training system with efficient parallelization techniques.
 
-arXiv: [Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training](https://arxiv.org/abs/2110.14883)
+Paper: [Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training](https://arxiv.org/abs/2110.14883)
+
+Blog: [Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training](https://www.hpcaitech.com/blog)
 
 ## Installation
 
@@ -91,16 +93,25 @@ class MLP_2D(nn.Module):
 
 ## Features
 
-ColossalAI provides a collection of parallel training components for you. We aim to support you to write your
+Colossal-AI provides a collection of parallel training components for you. We aim to support you to write your
 distributed deep learning models just like how you write your single-GPU model. We provide friendly tools to kickstart
 distributed training in a few lines.
 
 - [Data Parallelism](./docs/parallelization.md)
 - [Pipeline Parallelism](./docs/parallelization.md)
 - [1D, 2D, 2.5D, 3D and sequence parallelism](./docs/parallelization.md)
-- [friendly trainer and engine](./docs/trainer_engine.md)
+- [Friendly trainer and engine](./docs/trainer_engine.md)
 - [Extensible for new parallelism](./docs/add_your_parallel.md)
 - [Mixed Precision Training](./docs/amp.md)
 - [Zero Redundancy Optimizer (ZeRO)](./docs/zero.md)
 
+## Cite Us
 
+```
+@article{bian2021colossal,
+  title={Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training},
+  author={Bian, Zhengda and Liu, Hongxin and Wang, Boxiang and Huang, Haichen and Li, Yongbin and Wang, Chuanrui and Cui, Fan and You, Yang},
+  journal={arXiv preprint arXiv:2110.14883},
+  year={2021}
+}
+```
@@ -2,7 +2,7 @@
 
 ## Overview
 
-To enable researchers and engineers to extend our framework to other novel large-scale distributed training algorithm
+To enable researchers and engineers to extend our system to other novel large-scale distributed training algorithm
 with less effort, we have decoupled various components in the training lifecycle. You can implement your own
 parallelism by simply inheriting from the base class.
 
@@ -15,7 +15,7 @@ The main components are:
 ## Process Group Initializer
 
 Parallelism is often managed by process groups where processes involved in the same parallel algorithm are placed in the same
-process group. For different parallel algorithms, different process groups need to be created. ColossalAI provides a
+process group. For different parallel algorithms, different process groups need to be created. Colossal-AI provides a
 global context for users to easily manage their process groups. If you wish to add new process group, you can easily
 define a new class and set it in your configuration file. To define your own way of creating process groups, you can
 follow the steps below to create a new distributed initialization.
@@ -110,7 +110,7 @@ dist_initializer = [
 
 ## Schedule
 
-Schedule entails how to execute a forward and backward pass. Currently, ColossalAI provides pipeline and non-pipeline
+Schedule entails how to execute a forward and backward pass. Currently, Colossal-AI provides pipeline and non-pipeline
 schedules. If you want to modify how the forward and backward passes are executed, you can
 inherit `colossalai.engine.BaseSchedule` and implement your idea. You can also add your schedule to the engine before
 training.
@@ -1,7 +1,6 @@
 # 添加新的并行技术
 
-为了方便科研人员和工程师们更方便地拓展我们的框架来兼容一些新的大规模分布式训练算法，我们对训练过程中的几个组件进行了解耦，您可以通过继承基类的方式
-来实现新的并行技术。
+为了方便科研人员和工程师们更方便地拓展我们的系统来兼容一些新的大规模分布式训练算法，我们对训练过程中的几个组件进行了解耦，您可以通过继承基类的方式来实现新的并行技术。
 
 主要的组件如下所示：
 
@@ -11,9 +10,7 @@
 
 ## 进程组初始化器
 
-并行化一般是通过进程组来进行管理的，同属于一个并行化算法的进程将被分到一个进程组中，如果系统中存在多种不同的并行化技术，那么需要创建多个不同的进程组。
-ColossalAI为用户提供了一个全局上下文变量来便捷地管理他们的进程组。如果您希望增加新的进程组，您可以定义一个新的类并且在您的配置文件中进行设置。下方的
-代码块中介绍了如果在系统中加入您的新并行技术以及如何进行初始化。
+并行化一般是通过进程组来进行管理的，同属于一个并行化算法的进程将被分到一个进程组中，如果系统中存在多种不同的并行化技术，那么需要创建多个不同的进程组。Colossal-AI为用户提供了一个全局上下文变量来便捷地管理他们的进程组。如果您希望增加新的进程组，您可以定义一个新的类并且在您的配置文件中进行设置。下方的代码块介绍了如何在系统中加入您的新颖并行技术以及如何进行初始化。
 
 1. 在`colossalai.context.parallel_mode.ParallelMode`中添加新的并行模式。
 ```python
@@ -28,9 +25,7 @@ class ParallelMode(Enum):
     NEW_MODE = 'new_mode'  # define your mode here
 ```
 
-2. 创建一个`ProcessGroupInitializer`的子类，您可以参考`colossalai.context.dist_group_initializer`中给出的例子。前六个参数将由`ParallelContext`
-决定。如果您需要设置新的参数，您可以用新的参数替换下面例子中的`arg1`与`arg2`。最后，您需要使用`@DIST_GROUP_INITIALIZER.register_module`装饰器
-在我们的注册表注册您的初始化器。
+2. 创建一个`ProcessGroupInitializer`的子类，您可以参考`colossalai.context.dist_group_initializer`中给出的例子。前六个参数将由`ParallelContext`决定。如果您需要设置新的参数，您可以用新的参数替换下面例子中的`arg1`与`arg2`。最后，您需要使用`@DIST_GROUP_INITIALIZER.register_module`装饰器在我们的注册表中注册您的初始化器。
 ```python
 # sample initializer class
 @DIST_GROUP_INITIALIZER.register_module
@@ -55,14 +50,13 @@ class MyParallelInitializer(ProcessGroupInitializer):
         pass
 ```
 
-在此之后，您可以将您的初始化器插入到当前的mode-to-initialize映射`colossalai.constants.INITIALIZER_MAPPING`中，您也可以通过更改该文件来动态变更名称与
-并行模式的映射。
+在此之后，您可以将您的初始化器插入到当前的mode-to-initialize映射`colossalai.constants.INITIALIZER_MAPPING`中，您也可以通过更改该文件来动态变更名称与并行模式的映射。
 
 ```python
 colossalai.constants.INITIALIZER_MAPPING['new_mode'] = 'MyParallelInitializer'
 ```
 
-3. 在配置文件中设置您的初始化器，如果您的初始化器需要参数，您可以自行传入，下面的代码可以让`ParallelContext`来创建您的初始化器并初始化您需要的进程组。
+3. 在配置文件中设置您的初始化器。如果您的初始化器需要参数，您可以自行传入。下面的代码可以让`ParallelContext`来创建您的初始化器并初始化您需要的进程组。
 
 ```python
 parallel = dict(
@@ -73,9 +67,7 @@ parallel = dict(
 
 ## 梯度处理器
 
-梯度处理器的功能是对模型参数的梯度进行all-reduce操作。由于不同的并行技术可能需要不同的all-reduce操作，用户们可以通过继承
-`colossalai.engine.gradient_handler.BaseGradientHandler`来执行其个性化操作。目前，ColossalAI使用普通的数据并行梯度处理器，该处理器在所有的数据
-并行rank上执行all-reduce操作，且当ColossalAI监测到当前系统使用了数据并行时，该处理器会被自动创建。您可以使用下方代码块中的代码添加您自定义的梯度处理器：
+梯度处理器的功能是对模型参数的梯度进行all-reduce操作。由于不同的并行技术可能需要不同的all-reduce操作，用户们可以通过继承`colossalai.engine.gradient_handler.BaseGradientHandler`来执行其个性化操作。目前，Colossal-AI使用普通的数据并行梯度处理器，该处理器在所有的数据并行rank上执行all-reduce操作，且当Colossal-AI检测到当前系统使用了数据并行时，该处理器会被自动创建。您可以使用下方代码块中的代码添加您自定义的梯度处理器：
 
 ```python
 from colossalai.registry import GRADIENT_HANDLER
@@ -99,5 +91,4 @@ dist_initializer = [
 
 ## 调度器
 
-调度器中指定了在前向传播和后向传播时需要执行哪些操作，ColossalAI提供了支持流水线和不支持流水线的调度器。如果您想要修改前向传播和后向传播的执行方式，您可以
-继承`colossalai.engine.BaseSchedule`并实现您想要的操作。您也可以在训练模型之前将您的调度器添加到我们的引擎中来。
+调度器中指定了在前向传播和后向传播时需要执行哪些操作，Colossal-AI提供了流水线和非流水线的调度器。如果您想要修改前向传播和后向传播的执行方式，您可以继承`colossalai.engine.BaseSchedule`并实现您想要的操作。您也可以在训练模型之前将您的调度器添加到我们的引擎中来。
@@ -1,6 +1,6 @@
 # Mixed precision training
 
-In ColossalAI, we have incorporated different implementations of mixed precision training:
+In Colossal-AI, we have incorporated different implementations of mixed precision training:
 1. torch.cuda.amp
 2. apex.amp
 3. tensor-parallel amp
 
@@ -1,21 +1,17 @@
 # 混合精度训练
 
-ColossalAI可以使用如下三种不同的混合精度训练方式：
+Colossal-AI可以使用如下三种不同的混合精度训练方式：
 1. torch.cuda.amp
 2. apex.amp
 3. 张量并行AMP
 
-前两种混合精度训练方式依赖于[PyTorch](https://pytorch.org/docs/stable/amp.html)的原生实现（1.6或以上版本）以及
-[Nvidia Apex](https://github.com/NVIDIA/apex)，但这两种方法与张量并行并不兼容，因为在张量并行中我们需要将张量进行切分并保存在不同的设备上，
-因此，实现兼容张量并行的混合精度训练需要在不同进程之间不断通信来交流`inf`以及`nan`是否存在于模型参数中，因此我们才用了
-[Megatron-LM](https://github.com/NVIDIA/Megatron-LM)的实现方式。
+前两种混合精度训练方式依赖于[PyTorch](https://pytorch.org/docs/stable/amp.html)的原生实现（1.6或以上版本）以及[Nvidia Apex](https://github.com/NVIDIA/apex)，但这两种方法与张量并行并不兼容，因为在张量并行中我们需要将张量进行切分并保存在不同的设备上，因此，实现兼容张量并行的混合精度训练需要在不同进程之间不断通信来交流`inf`以及`nan`是否存在于模型参数中，因此我们采用了[Megatron-LM](https://github.com/NVIDIA/Megatron-LM)的实现方式。
 
-您可以简单地将配置文件中的`fp16`字段设置为True来使用混合精度训练。目前，PyTorch与Apex的amp不能保证与张量和流水线并行兼容，因此，我们推荐您使用
-最后一种混合精度训练方式。
+您可以简单地将配置文件中的`fp16`字段设置为True来使用混合精度训练。目前，PyTorch与Apex的amp不能保证与张量和流水线并行兼容，因此，我们推荐您使用最后一种混合精度训练方式。
 
 ## PyTorch AMP
 
-PyTorch在1.6及以上版本中提供了混合精度训练，其可以在保持一些操作的精度为`fp32`的同时，将数据转换成`fp16`格式，您可以在配置文件中配置使用。
+PyTorch在1.6及以上版本中提供了混合精度训练，它可以在保持一些操作的精度为`fp32`的同时，将数据转换成`fp16`格式，您可以在配置文件中配置使用。
 
 ```python
 from colossalai.engine import AMP_TYPE
@@ -33,8 +29,7 @@ fp16=dict(
 
 ## Apex AMP
 
-我们使用了[Apex](https://nvidia.github.io/apex/)中的混合精度训练，因为该模式提供了细粒度的混合精度控制，例如，`O2`级（第二级优化器）将会保持
-批标准化在`fp32`上进行。下面的代码块展示了使用Apex AMP的配置文件。
+我们使用了[Apex](https://nvidia.github.io/apex/)中的混合精度训练，因为该模式提供了细粒度的混合精度控制，例如，`O2`级（第二级优化器）将会保持批标准化在`fp32`上进行。下面的代码块展示了使用Apex AMP的配置文件。
 
 ```python
 from colossalai.engine import AMP_TYPE
@@ -59,7 +54,7 @@ fp16 = dict(
 
 ## 张量并行AMP
 
-我们借鉴了Megatron-LM的混合精度训练实现，该实现方式与张量并行与流水线并行相兼容。下面的代码块展示了使用张量并行AMP的配置文件。
+我们借鉴了Megatron-LM的混合精度训练实现，该实现方式与张量并行、流水线并行相兼容。下面的代码块展示了使用张量并行AMP的配置文件。
 
 ```python
 from colossalai.engine import AMP_TYPE
 
@@ -17,7 +17,7 @@
 
 # -- Project information -----------------------------------------------------
 
-project = 'ColossalAI'
+project = 'Colossal-AI'
 copyright = '2021, HPC-AI Tech'
 author = 'HPC-AI Technology Inc.'
 
 
@@ -1,6 +1,6 @@
 # Config file
 
-Here is a config file example showing how to train a ViT model on the CIFAR10 dataset using ColossalAI:
+Here is a config file example showing how to train a ViT model on the CIFAR10 dataset using Colossal-AI:
 
 ```python
 # build train_dataset and train_dataloader from this dictionary
 
@@ -1,6 +1,6 @@
 # 配置文件
 
-下方代码块中的示例展示了如何在CIFAR10数据集上使用ColossalAI训练ViT模型。
+下方代码块中的示例展示了如何在CIFAR10数据集上使用Colossal-AI训练ViT模型。
 
 ```python
 # build train_dataset and train_dataloader from this dictionary
 
@@ -1,9 +1,9 @@
-.. ColossalAI documentation master file, created by
+.. Colossal-AI documentation master file, created by
    sphinx-quickstart on Mon Oct 11 17:05:05 2021.
    You can adapt this file completely to your liking, but it should at least
    contain the root `toctree` directive.
 
-ColossalAI开发文档
+夸父AI系统（Colossal-AI）开发文档
 ======================================
 .. toctree::
    :maxdepth: 1
 
@@ -1,9 +1,9 @@
-.. ColossalAI documentation master file, created by
+.. Colossal-AI documentation master file, created by
    sphinx-quickstart on Mon Oct 11 17:05:05 2021.
    You can adapt this file completely to your liking, but it should at least
    contain the root `toctree` directive.
 
-ColossalAI documentation
+Colossal-AI documentation
 ======================================
 .. toctree::
    :maxdepth: 1