@@ -90,7 +133,13 @@ DLRover 在重启训练子进程前运行一个简单的 allgather 任务来排
### DLRover 错误日志收集
-在 PyTorch 分布式训练中,一个节点的进程出错后,Torch Elastic 会停止所有节点的进程。各个进程的日志都是单独存在各自日志文件中。为了找到训练失败是哪个进程出错导致的,我们需要搜索所有进程的日志。这个工作对于千卡作业是十分耗时且繁琐的。为此,我们在 ElasticAgent 中开发了错误日志收集供功能。当 ElasticAgent 发现子进程失败后,后将其错误日志的 message 发送给 job master。job master 会在其日志中展示具体哪个节点的那个进程失败了,以及错误日志。这样用户只需看下 job master 的节点日志就可以定位训练失败原因了。同时我们也支持将错误信息上报给钉钉。
+在 PyTorch 分布式训练中,一个节点的进程出错后,Torch Elastic 会停止所有节点的进程。
+各个进程的日志都是单独存在各自日志文件中。为了找到训练失败是哪个进程出错导致的,我们需要搜索所有进程的日志
+。这个工作对于千卡作业是十分耗时且繁琐的。为此,我们在 ElasticAgent 中开发了错误日志收集供功能。
+当 ElasticAgent 发现子进程失败后,后将其错误日志的 message 发送给 job master。
+job master 会在其日志中展示具体哪个节点的那个进程失败了,以及错误日志。
+这样用户只需看下 job master 的节点日志就可以定位训练失败原因了。同时我们也支持将错误信息上报给钉钉。
+
```json
任务 torch-train 训练进程失败 torch-train-edljob worker-116 restart 0 fails: {
"784": {
@@ -110,7 +159,9 @@ DLRover 在重启训练子进程前运行一个简单的 allgather 任务来排
## FSDP 并行的 save/load 优化
-DLRover 弹性容错需要依赖 checkpoint 来恢复模型状态。当前我们的大模型训练采用 FSDP 的并行方式,FSDP 保存 checkpoint 的方案有两种:1. rank0_only :由 RANK-0 节点获取所有的模型参数和优化器状态存入磁盘,2.sharding方式:所有 RANK 各自保存其模型参数和优化器状态。但是这两个方案都没法满足弹性容错训练的需求。
+DLRover 弹性容错需要依赖 checkpoint 来恢复模型状态。当前我们的大模型训练采用 FSDP 的并行方式,
+FSDP 保存 checkpoint 的方案有两种:1. rank0_only :由 RANK-0 节点获取所有的模型参数和优化器状态存入磁盘,
+2.sharding方式:所有 RANK 各自保存其模型参数和优化器状态。但是这两个方案都没法满足弹性容错训练的需求。
rank0_only:
- RANK-0 需要加载所有的模型参数和优化器状态,可能导致 OOM。
@@ -122,16 +173,19 @@ sharding 方式:
### 参数支持 reshard 的 save/load
-原始 torch save 是将整个参数进行 pickle,load 时整体进行 unpickle,因此内存会出现峰值。为解决该问题,我们在 ATorch 中将 save 的过程拆开,先生成 safetensors 的 meta data,之后按需逐个序列化每个 tensor,再进行写入。
-在保存时,直接保存每个 rank 上的 flat param,同时保存一份对应的 meta 信息。如下图所示,每个 flat param 中保存了多个 meta 信息,每个 meta 信息代表这个 flat param 中原始参数的 shape 和在 flat param 中的 start 和 end,因此在恢复参数时,只需要按照顺序将所有的 param 找出来,拼接到一起后,再进行 reshape 即可获得原始的参数。
+原始 torch save 是将整个参数进行 pickle,load 时整体进行 unpickle,因此内存会出现峰值。
+为解决该问题,我们在 ATorch 中将 save 的过程拆开,先生成 safetensors 的 meta data,之后按需逐个序列化每个 tensor,再进行写入。
+在保存时,直接保存每个 rank 上的 flat param,同时保存一份对应的 meta 信息。如下图所示,
+每个 flat param 中保存了多个 meta 信息,每个 meta 信息代表这个 flat param 中原始参数的 shape 和在 flat param 中的 start 和 end,
+因此在恢复参数时,只需要按照顺序将所有的 param 找出来,拼接到一起后,再进行 reshape 即可获得原始的参数。
-
FSDP flat param 的逻辑格式
代码示例:
+
```python
from atorch.utils.fsdp_save_util import save_fsdp_flat_param
model = ... # atorch 转换 FSDP 的模型
@@ -147,6 +201,7 @@ ckpt
└── flat_param.00001-00002
"""
```
+
```python
# init_empty_weights_with_disk_offload 时指定 ckpt 地址,会将模型全部在 meta 上
# 初始化,在 FSDP 转换时按需加载 ckpt 地址
@@ -157,15 +212,21 @@ with init_empty_weights_with_disk_offload(ckpt_path='ckpt'):
### 优化器状态支持 reshard 的save/load
-FSDP 并行训练时,优化器是基于 FSDP 转化后的模型创建的,atorch 会配置 FSDP 的 use_orig_param。这时优化器状态的结构与 flat param 结构相同。如果某些参数不在 flat param 中,则优化器状态获取到的参数为空。同时还保存了优化器状态的 meta 信息,为优化器状态的 param group 信息。
+FSDP 并行训练时,优化器是基于 FSDP 转化后的模型创建的,atorch 会配置 FSDP 的 use_orig_param。这时优化器状态的结构与
+flat param 结构相同。如果某些参数不在 flat param 中,则优化器状态获取到的参数为空。同时还保存了优化器状态的 meta 信息,为优化器状态的 param group 信息。
FSDP use_orig_param 的优化器状态的逻辑格式
-因此在保存的时候,优化器状态也是 flatten 为 1D 的数据。在恢复优化器状态时,使用了 FSDP 提供的 `FSDP.shard_full_optim_state_dict`函数,该函数接收的参数为完整的优化器状态和 FSDP wrap 好的模型来重新切分优化器状态。该函数最终调用 `torch.distributed.fsdp._optim_utils._shard_orig_param_state` 函数来切分状态,并且该函数在 torch 内部只有这一处调用,因此 hook 该函数的实现。
-实际在内部实现时,reshard 根据 FSDP 包好的模型来获取优化器状态的数值区间,该区间在 FSDP 内部为intra_param_start_idx,intra_param_end_idx 参数,含义为新的参数在原始 flatten 权重的取值范围。如下图所示,如果由于修改了 rank/wrap 使得 FSDP 的模型产生了变化,则需要重新切分优化器参数。
+因此在保存的时候,优化器状态也是 flatten 为 1D 的数据。在恢复优化器状态时,使用了 FSDP 提供的 `FSDP.shard_full_optim_state_dict`函数,
+该函数接收的参数为完整的优化器状态和 FSDP wrap 好的模型来重新切分优化器状态。
+该函数最终调用 `torch.distributed.fsdp._optim_utils._shard_orig_param_state` 函数来切分状态,
+并且该函数在 torch 内部只有这一处调用,因此 hook 该函数的实现。
+实际在内部实现时,reshard 根据 FSDP 包好的模型来获取优化器状态的数值区间,
+该区间在 FSDP 内部为intra_param_start_idx,intra_param_end_idx 参数,含义为新的参数在原始 flatten 权重的取值范围。
+如下图所示,如果由于修改了 rank/wrap 使得 FSDP 的模型产生了变化,则需要重新切分优化器参数。

@@ -173,6 +234,7 @@ FSDP use_orig_param 的优化器状态的逻辑格式
FSDP 优化器状态 reshard 示意图
代码示例
+
```python
from atorch.utils.fsdp_save_util import save_fsdp_optim_param
# model, optimizer 均是经过 atorch FSDP 转换的对象
@@ -185,6 +247,7 @@ ckpt
└── optim_param.00001-00002
"""
```
+
```python
from atorch.utils.fsdp_save_util import ShardOptim
sm = ShardOptim("ckpt")
@@ -192,9 +255,10 @@ reshard_optim_state = sm.reshard_optim_state_dict(model)
optimizer.load_state_dict(reshard_optim_state)
```
-## 弹性容错在千亿级大模型训练的应用效果
+## 弹性容错在千亿级大模型训练的应用效果
-在使用 DLRover 弹性容错之前,Torch 大模型训练只要出错就要重启训练作业。为了及时重启作业,用户写了个程序每隔10min 来检测作业状态。如果失败,就会重启作业。
+在使用 DLRover 弹性容错之前,Torch 大模型训练只要出错就要重启训练作业。为了及时重启作业,
+用户写了个程序每隔10min 来检测作业状态。如果失败,就会重启作业。

@@ -202,7 +266,7 @@ optimizer.load_state_dict(reshard_optim_state)
下面对比了训练失败时使用 DLRover 弹性容错前后的耗时。
-|
+|
| 没有弹性容错 | DLRover 弹性容错 | |
| --- | --- | --- | --- |
| 训练恢复步骤 | 任何故障 | 机器硬件故障 | 软件故障 |
@@ -218,6 +282,7 @@ optimizer.load_state_dict(reshard_optim_state)
## Kubernetes 上提交 GPT 弹性容错作业
1. 在 Kubernetes 集群上部署 DLRover ElasticJob CRD。
+
```python
git clone git@github.com:intelligent-machine-learning/dlrover.git
cd dlrover/go/operator/
@@ -225,6 +290,7 @@ make deploy IMG=easydl/elasticjob-controller:master
```
2. 在构造训练镜像的 dockerfile 中安装 dlrover[torch]。
+
```python
FROM registry.cn-hangzhou.aliyuncs.com/easydl/dlrover-train:torch201-py38 as base
@@ -236,7 +302,10 @@ COPY ./model_zoo ./model_zoo
```
-3. 在 ElasticJob 的container 的 command里使用 dlrover-run 在运行训练脚本。在镜像 registry.cn-hangzhou.aliyuncs.com/easydl/dlrover-train:nanogpt-test 我们已经准备好了代码和训练数据,可以直接用如下 ElasticJob 来提交示例作业。
+3. 在 ElasticJob 的container 的 command里使用 dlrover-run 在运行训练脚本。
+在镜像 registry.cn-hangzhou.aliyuncs.com/easydl/dlrover-train:nanogpt-test
+我们已经准备好了代码和训练数据,可以直接用如下 ElasticJob 来提交示例作业。
+
```yaml
apiVersion: elastic.iml.github.io/v1alpha1
kind: ElasticJob
@@ -276,6 +345,10 @@ spec:
```
-# 总结 & 未来计划
+## 总结 & 未来计划
-DLRover 目前已经在蚂蚁千亿模型训练训练上落地,将GPU故障导致训练暂停时间由 30%降低到了约 12%。我们希望 DLRover 在大规模分布式训练上提供智能化运维功能,降低用户运维成本,提升训练的稳定性。后续我们将介绍蚂蚁在千亿模型训练上的 PyTorch 性能优化方案的扩展包 ATorch,ATorch 旨在提升大规模 GPU 训练的硬件算力效率 HFU (Hardware Flops Utilization) 和训练的稳定性,当前蚂蚁千亿大模型训练使用 Atorch 的 HFU 为 49.6%。我们欢迎不同机构的开发者也能根据自身特点,同我们一起共建 DLRover 项目,推进分布式自动化。
+DLRover 目前已经在蚂蚁千亿模型训练训练上落地,将GPU故障导致训练暂停时间由 30%降低到了约 12%。
+我们希望 DLRover 在大规模分布式训练上提供智能化运维功能,降低用户运维成本,提升训练的稳定性。
+后续我们将介绍蚂蚁在千亿模型训练上的 PyTorch 性能优化方案的扩展包 ATorch,ATorch 旨在提升大规模
+GPU 训练的硬件算力效率 HFU (Hardware Flops Utilization) 和训练的稳定性,当前蚂蚁千亿大模型训练使用
+Atorch 的 HFU 为 49.6%。我们欢迎不同机构的开发者也能根据自身特点,同我们一起共建 DLRover 项目,推进分布式自动化。
diff --git a/docs/deployment/controller.md b/docs/deployment/controller.md
index 478bc0c2a..a3108eb4a 100644
--- a/docs/deployment/controller.md
+++ b/docs/deployment/controller.md
@@ -1,9 +1,13 @@
# Deploy DLRover ElasticJob Controller on a Kubernetes Cluster
-Here, we introduce how to deploy the DLRover job controller directly on a Kubernetes cluster step by step. Minikube is optional and primarily used for testing.
+Here, we introduce how to deploy the DLRover job controller directly on a
+Kubernetes cluster step by step. Minikube is optional and primarily used for testing.
## 1. Preliminary
-- Ensure you have [Kubernetes](https://kubernetes.io/docs/home/) installed. If you prefer to use Minikube for testing purposes, make sure to have [Minikube](https://minikube.sigs.k8s.io/docs/start/) installed and run `minikube start`.
+
+- Ensure you have [Kubernetes](https://kubernetes.io/docs/home/) installed.
+If you prefer to use Minikube for testing purposes, make sure to have [Minikube](https://minikube.sigs.k8s.io/docs/start/)
+installed and run `minikube start`.
## 3. Deploy Dlrover ElasticJob Controller With Kubectl
@@ -16,13 +20,14 @@ $ deployment="git@github.com:intelligent-machine-learning/dlrover/dlrover/go/ope
$ kubectl -n dlrover apply -k $deployment
```
-To verify the controller has been deployed, run the command below. The output should show the dlrover-controller-manager pod is running.
+To verify the controller has been deployed, run the command below.
+The output should show the dlrover-controller-manager pod is running.
```bash
kubectl -n dlrover get pods
```
-```
+```bash
NAME READY STATUS RESTARTS AGE
pod/dlrover-controller-manager-7dccdf6c4d-grmks 2/2 Running 0 6m46s
```
@@ -38,10 +43,11 @@ Check traning nodes.
```bash
kubectl -n dlrover get pods
```
-```
+
+```bash
NAME READY STATUS RESTARTS AGE
pod/dlrover-controller-manager-7dccdf6c4d-grmks 2/2 Running 0 4h49m
pod/elasticjob-torch-mnist-dlrover-master 1/1 Running 0 4h42m
pod/torch-mnist-edljob-worker-0 1/1 Running 0 4h42m
pod/torch-mnist-edljob-worker-1 1/1 Running 0 4h42m
-```
\ No newline at end of file
+```
diff --git a/docs/deployment/k8s.md b/docs/deployment/k8s.md
index a641d2223..88bb515c8 100644
--- a/docs/deployment/k8s.md
+++ b/docs/deployment/k8s.md
@@ -6,23 +6,23 @@ step by step.
## Create namespace
```shell
-$ kubectl create namespace dlrover
+kubectl create namespace dlrover
```
-## Deploy MySQL
+## Deploy MySQL
To create MySQL DB as the store for ELRover
```shell
-$ cd dlrover/go/brain/manifests/k8s
-$ kubectl apply -f mysql-pv.yaml
-$ kubectl apply -f mysql.yaml
+cd dlrover/go/brain/manifests/k8s
+kubectl apply -f mysql-pv.yaml
+kubectl apply -f mysql.yaml
```
Create tables in MySQL
```shell
-$ kubectl exec -it mysql-pod-name --namespace dlrover -- bash
-$ cd dlrover
-$ mysql -uroot -proot < dlrover-tables.sql
-```
\ No newline at end of file
+kubectl exec -it mysql-pod-name --namespace dlrover -- bash
+cd dlrover
+mysql -uroot -proot < dlrover-tables.sql
+```
diff --git a/docs/design/db-design.md b/docs/design/db-design.md
index 84e844976..2b843afed 100644
--- a/docs/design/db-design.md
+++ b/docs/design/db-design.md
@@ -49,4 +49,4 @@ create table cluster(
customized_data mediumtext, // cluster customized data
PRIMARY KEY (uid)
)
-```
\ No newline at end of file
+```
diff --git a/docs/design/dlrover-overview.md b/docs/design/dlrover-overview.md
index 4452d3608..614e25476 100644
--- a/docs/design/dlrover-overview.md
+++ b/docs/design/dlrover-overview.md
@@ -2,11 +2,11 @@
DLRover is an automatic distributed deep learning system.
DLRover can help users train their models with minimal efforts. For example,
-with DLRover, users need not provide any resource configuration for their
-deep learning training jobs. Instead, DLRover can pick up the appropriate resource
+with DLRover, users need not provide any resource configuration for their
+deep learning training jobs. Instead, DLRover can pick up the appropriate resource
configuration for each job smartly and continue to optimize those jobs during their runtime.
-DLRover's short-term goal is to support automatic resource configuration for DL training jobs.
+DLRover's short-term goal is to support automatic resource configuration for DL training jobs.
However, the long-term goal of DLRover is to make the whole deep learning model
training completely automatic.
@@ -20,14 +20,14 @@ workers. Using allreduce architecture, we need to take
account of the increasing communication cost with more workers.
It is difficult to configure the appropriate resource with different models.
-Model developers (users) have to learn more rather than model training
-algorithms when they are using those jobs to train their models. To
-run a training job, those users have to specify the required resources for their
-this job. Then the Kubernetes cluster can allocate the required resources and
+Model developers (users) have to learn more rather than model training
+algorithms when they are using those jobs to train their models. To
+run a training job, those users have to specify the required resources for their
+this job. Then the Kubernetes cluster can allocate the required resources and
start the job. Unfortunately, we found it is quite an ineffective way
to ask the users to take care of the resource configuration.
-At first, users are usually the
-experts on model design but not training jobs and Kubernetes cluster. It is
+At first, users are usually the
+experts on model design but not training jobs and Kubernetes cluster. It is
not an easy task for them to have the optimal configuration in the first place.
Secondly, a training job's resources requirement may vary during its runtime.
A static resource configuration usually can not be the optimal one all the time.
@@ -39,7 +39,7 @@ users fail to provide the optimal resource configuration for their jobs.
We hope to design and implement a system which can free users from resource
configuration completely and focus on the model training itself. Without any
input (on resource configuration), DLRover can still provide the optimal
-resource plan for each training job. Meanwhile, DLRover can optimize the
+resource plan for each training job. Meanwhile, DLRover can optimize the
performance of training jobs further through resource adjustment when a job
is running.
@@ -48,13 +48,13 @@ supporting three different modes to satisfy users' requirements.
### Manual Mode
-Sometimes users want to explore a single job's performance through manually scaling this
+Sometimes users want to explore a single job's performance through manually scaling this
job's resources during runtime. DLRover allows users to apply new resource configuration
for a running job without restarting the job.
### Single-Job Mode
-During DL model development, users usually repeatedly train and test a model before
+During DL model development, users usually repeatedly train and test a model before
the model reaches a stable status. In this scenario, users only need to run a single job
without deploying extra components. However, single-job mode also supports resource auto-configuration
for the job. In this mode, auto-scale algorithms are located in the master of the job
@@ -66,11 +66,11 @@ not support the fault-tolerance of the master.
### Cluster Mode
-In the cluster mode, DLRover handles all training jobs in a cluster and
-executes with complete functions.
+In the cluster mode, DLRover handles all training jobs in a cluster and
+executes with complete functions.
-Unlike single-job mode, DLRover in cluster mode has a separate service called
-*Brain* which provides resource plans for each running job in the cluster.
+Unlike single-job mode, DLRover in cluster mode has a separate service called
+*Brain* which provides resource plans for each running job in the cluster.
The brain service persists all runtime statistics
of jobs into a database. The algorithm can utilize information of all
finished and running jobs to optimize the resources of new jobs. After
@@ -79,7 +79,6 @@ What's more, the master of a job only executes the resource plans from
brain service. When the master fails, DLRover can simply restart a new one
and the job can continue to run.
-
## Design
DLRover consists of four main components: ElasticJob, Elastic Trainer,
@@ -99,13 +98,14 @@ launch required Pods and each Pod will start an Elastic Agent on it.
During training, the training master of Elastic Trainer dispatches data shards
to workers. Meanwhile, the Cluster Monitor is monitoring
each job's running status (e.g., resource workload of each node) and
-cluster status (e.g., idle resources). Those data will be reported to Brain periodically and
-Brain persists the data into database. Then based on the job’s running status,
+cluster status (e.g., idle resources). Those data will be reported to Brain periodically and
+Brain persists the data into database. Then based on the job’s running status,
DLRover Brain picks up appropriate algorithms to
generate new resource plans and informs Elastic Trainer to
start resources adjustment.
### ElasticJob to Support Elastic Scheduling
+
ElasticJob is a customized k8s controller to support elastic scheduling
of Pods for DL training jobs. ElasticJob is responsible to launch/delete
Pods on a k8s cluster according to a Scale CRD.
@@ -123,7 +123,7 @@ to launch/delete paramter servers and workers.
### Elastic Trainer to Support Auto-scaling of a Single Job
-For each training job, there is an Elastic Trainer to manage the job during
+For each training job, there is an Elastic Trainer to manage the job during
the job's whole life cycle. Elastic Trainer is to:
1. provide dynamic data sharding to support elasticity of a job.
@@ -151,7 +151,7 @@ those samples. All shards are placed into a TODO queue.
After a worker starts to run, the data input pipeline of a worker
will query one shard from Elastic Trainer and read samples by indices
in the shard. Meanwhile, Data Shard Manager marks this shard with the
-id of the worker and moves the shard from the TODO to the DOING queue.
+id of the worker and moves the shard from the TODO to the DOING queue.
After a worker consumes samples in the shard and updates parameters in PS,
it reports to the training master and queries a new shard.
Then Data Shard Manager deletes the finished shard from the DOING queue.
@@ -160,12 +160,12 @@ Then Data Shard Manager deletes the finished shard from the DOING queue.
-
#### Elasticity of PS Training
1. Worker elasticity. In asynchronous SGD, each PS updates parameters with gradients
from a worker independently and does not synchronize with other workers.
-Thus, Elastic Trainer can add or remove workers without influencing other workers. After a new worker starts, it connects to all PS and queries shards from Data Shard Manager
+Thus, Elastic Trainer can add or remove workers without influencing other workers.
+After a new worker starts, it connects to all PS and queries shards from Data Shard Manager
and consume shards to compute gradients. If a worker is terminated,
Data Shard Manager moves uncompleted shards of this worker back to the TODO queue from the DOING queue.
Later the shard can be dispatched to another workers.
@@ -178,7 +178,6 @@ parameter servers to the Elastic Agent of all Pods. Then the Elastic
Agent will notify the training framework (e.g. TensorFlow) to restart
training and restore model paremeters from a checkpoint.
-
#### Elasticity of AllReduce Training
DLRover implements Fault-tolerance of allreduce
@@ -196,7 +195,7 @@ Meanwhile, the master watches the event of the failed worker by K8s APIs and
re-assign new ranks for alive workers. The oldest worker will get
the rank 0 and broadcast its model and optimization states
in the memory to other workers. Because the oldest worker certainly has the whole
-model at the time of worker fail. Then, the training continues.
+model at the time of worker fail. Then, the training continues.
2. Scalable. After new worker starts, it will send a start signal to the master
and the master will re-assign ranks with all alive workers. The worker
@@ -206,13 +205,13 @@ a new world with the new rank.
3. Fixed batch size. Not like Asynchronous training, the batch size $B$ of
synchronous stochastic gradient descent (SGD) is $𝐵 = 𝑁 ∗ 𝐵_𝑚$ . 𝑁 is the number
-of workers and 𝐵𝑚 is the size of mini-batch performed by each worker at each step.
-However, the batch size of synchronous SGD affects the model accuracy.
-So, the model accuracy may fluctuate if the number of workers changes at runtime.
+of workers and 𝐵𝑚 is the size of mini-batch performed by each worker at each step.
+However, the batch size of synchronous SGD affects the model accuracy.
+So, the model accuracy may fluctuate if the number of workers changes at runtime.
In order to overcome the challenge, DLRover supports fixed batch size at runtime
if the maximum number $N$ of workers is configured. Before the phase of al-reduce,
the master assigns the number of mini-batch computations to workers according to
-the number $N_0$ of existing workers. The worker 𝑖 will perform $𝑚_𝑖$ mini-batch
+the number $N_0$ of existing workers. The worker 𝑖 will perform $𝑚_𝑖$ mini-batch
before merging gradients across workers by all-reduce. $𝑚_𝑖 =⌊𝑁/𝑁_0⌋+1$ if $𝑖<𝑁\%𝑁_0$,
otherwise, $𝑚_𝑖 =⌊𝑁/𝑁_0⌋$ .
@@ -224,7 +223,7 @@ otherwise, $𝑚_𝑖 =⌊𝑁/𝑁_0⌋$ .
Parameter servers and workers can fail at any time. Thus the trainer will checkpoint
the parameters periodically. When a parameter server fail, the trainer starts
-another parameter server and resume the checkpointing. For worker failure,
+another parameter server and resume the checkpointing. For worker failure,
the trainer just starts a worker and let the work picks up a shard for computation.
### Brain Service to Support Auto-scaling Jobs in a Cluster
@@ -243,7 +242,7 @@ includes three components.
#### Administor
-When a training job is created, the corresponding administor is also created
+When a training job is created, the corresponding administor is also created
in the brain. This administor will administer the job during the job's whole
lifetime. When to initialize the job or observe a performance issue in the job,
the administor will create an optimize event for a new resource plan.
@@ -259,12 +258,12 @@ Then we can have the optimal resource plans.
#### Algorithm Executor
-After the optimize processor decides the algorithm for the job, the algorithm
+After the optimize processor decides the algorithm for the job, the algorithm
executor executes the algorithm and generates the resource plan.
### Cluster Monitor
-In order to detach Brain from a particular platform, Brain only use data in the database
+In order to detach Brain from a particular platform, Brain only use data in the database
to generate optimized resource plans for jobs. In this way, we can easily reuse similar algorithm
for different cluster platform (e.g., Kubernetes and Ray). Therefore, the Cluster Monitor is
-implemented for particular platform to collect jobs and cluster statistic data.
\ No newline at end of file
+implemented for particular platform to collect jobs and cluster statistic data.
diff --git a/docs/design/scale-node-design.md b/docs/design/scale-node-design.md
index cc8cd100d..e8139ffa7 100644
--- a/docs/design/scale-node-design.md
+++ b/docs/design/scale-node-design.md
@@ -11,6 +11,7 @@ the training process.
## Design
Auto-scaling in DLRover contains the following steps:
+
- The `JobResourceOptimizer` in `JobManager` queries a `ResourcePlan`.
- The `TrainingNodeManager` (e.g. `PSManager` and `WorkerManger`)
in `JobManager` generates the `ScalePlan`.
@@ -40,7 +41,7 @@ and PS.
of the job and ajust resource to mitigate the bottleneck.
At each stage, `JobResourceOptimizer` queries a `ResourcePlan` by calling its
-`ResourceOptimizer`.
+`ResourceOptimizer`.
The `ResourcePlan` contains resource configurations of training nodes. For
exampel:
@@ -141,15 +142,15 @@ If the number of PS is smaller than the current number of PS. `PSManager` will
not delete the additional PS nodes immediately. Because model parameters are
stored across PS nodes and will be lost if we delele PS nodes before
workers checkpoints model parameters on PS.
-`PSManager` will add those PS nodes which is to be removed
-to a queuee `_pre_dropped_ps` and remove those PS hosts from
+`PSManager` will add those PS nodes which is to be removed
+to a queuee `_pre_dropped_ps` and remove those PS hosts from
its `_next_training_ps_cluster`. After workers succeed to checkpoint model parameters
and connect the next PS cluster. `PSManager` will set those those PS nodes into `remove_nodes`
of a `ScalePlan`.
**Migrate PS.**
-If there is a updatation in a PS node's resource in `ResourcePlan.node_resources`,
-`PSManager` will create a PS `Node` with the new resource.
+If there is a updatation in a PS node's resource in `ResourcePlan.node_resources`,
+`PSManager` will create a PS `Node` with the new resource.
After the new PS node is running, `PSManager`
will update its `_next_training_ps_cluster` and notify
workers to connect new PS clusters. After workers succeed to connect new PS
@@ -177,7 +178,8 @@ for additional workers to be removed.
create/update/delete nodes to achieve the `ScalePlan`. We can implement
differenct `Scaler` for different distributed cluster.
-#### Pod Scaler
+#### Pod Scaler
+
`PodScaler` is implemented by K8s Python APIs to create/update/delete Pods
on a K8s cluster.
@@ -192,6 +194,7 @@ will replace the `Node` into the queue to retry.
by the name of the node in `remove_nodes`.
#### ElasticJob Scaler
+
`ElasticJobScaler` is implemented to create a `ScalePlan` CRD to notify the
[ElasticJob controller](docs/design/elastic-training-operator.md) to
reconcile Pods by the `ScalePlan` on a K8s cluster. The example of `ScalePlan` is
diff --git a/docs/design/streaming-data-splitter-and-manager.md b/docs/design/streaming-data-splitter-and-manager.md
index 691a71595..209bf3699 100644
--- a/docs/design/streaming-data-splitter-and-manager.md
+++ b/docs/design/streaming-data-splitter-and-manager.md
@@ -1,23 +1,29 @@
# Streaming DataShardManger and Splitter
+
The design describes the architecture of the Streaming DataShardManger.
The Streaming DataShardManger is responsible for dispatching data and keep data checkpoints.
## An Intro to Online learning
+
Online learning represents a family of machine learning methods, where a learner attempts
-to tackle some predictive task by learning from a sequence of data instances one by one at each time. In contrast, offline/batch learner
+to tackle some predictive task by learning from a sequence of data instances one by one
+at each time. In contrast, offline/batch learner
learns from static shuffled data samples and are not sensitive to the data sequence.
Online learning has become a promising technique for learning from continuous
-streams of data in many real-world applications.
+streams of data in many real-world applications.
Thus, the key point for online learning the data should be dispatched sequentially and consumed at least once.
## PartitionOffsets
-Stream processing is the processing of data in motion, or in other words, computing on data directly as it is produced or received.
-In addition, we would never know how many training samples are in advance and when they would arrive.
+
+Stream processing is the processing of data in motion, or in other words,
+computing on data directly as it is produced or received.
+In addition, we would never know how many training samples are in advance and when they would arrive.
As a result, the worker and ps keeps running and waiting for the upstream sample.
PartitionOffsets is responsible for holding consuming status of streaming data.
+
```Python
class PartitionOffsets(object):
@@ -28,31 +34,26 @@ class PartitionOffsets(object):
self.partition_num = 0
self.update_partitions()
```
+
## Streaming Data Splitter
The streaming data splitter assumes that streaming samples are stored in different partition and every sample is marked
with an offset which indicates the sample's sequence. Streaming data splitter is responsible for creating shards.
The shard contains offset ranges [start, end) and partition of records.
-
## Streaming DataShardManger Checkpoints
-- When doing checkpoints, Streaming DataShardManger saves not only current doing tasks and undo tasks but also the splitter info.
-- When restoring from checkpoints, Streaming DataShardManger loads not only current doing tasks and undo tasks but also the splitter info.
+- When doing checkpoints, Streaming DataShardManger saves not only current doing tasks and
+undo tasks but also the splitter info.
+- When restoring from checkpoints, Streaming DataShardManger loads not only current doing
+tasks and undo tasks but also the splitter info.
## Streaming Reader
As for getting training data,there are two kind of modes of online learning:
- The training data is stored in the log store or kafka topic, the reader reads data from the log store or topic.
-- The training data is processed by a streaming job and sink of the job sends the data to a buffer. The reader reads data from the buffer. By this means, the worker is decoupled with data source.
-
-In conclusion, the worker is stateless in both online learning and offline learning.
-
-
-
-
-
-
-
+- The training data is processed by a streaming job and sink of the job sends the data to a buffer. The reader
+reads data from the buffer. By this means, the worker is decoupled with data source.
+In conclusion, the worker is stateless in both online learning and offline learning.
diff --git a/docs/design/training-master.md b/docs/design/training-master.md
index cf7ce273e..144d594eb 100644
--- a/docs/design/training-master.md
+++ b/docs/design/training-master.md
@@ -1,4 +1,5 @@
# Training Master of DLRover
+
The design describes the architecture of the training master of DLRover.
The master is responsible to controll the training of a single job and
provide the following services:
@@ -25,6 +26,7 @@ distributed systems.
## Architecture of the Training Master
The master contains 5 components:
+
- Resource Generator: it generates resource configuration plans for the job.
- Scaler: it generates Scale CRDs according to resource plans.
- Stats Collector: it collects the runtime statistics for the job, including
@@ -70,6 +72,7 @@ class StatsCollector(metaclass=ABCMeta):
def report_resource_usage(self):
pass
```
+
We can implement `report_resource_usage` to report the runtime statistics
(e.g. CPU/memory usage) of all parameter servers and workers
to DLRover Brain to persist them in a database like MySQL.
@@ -200,7 +203,7 @@ After a worker start to training, `DataShardManager` dispatch the shard
to the worker. After the worker uses up samples in the shard, it will
report a shard status to `DataShardManger`.
The shard only contains indices of smaples not the sample
-data.
+data.
```Python
class DataShardManger(metaclass=ABCmeta):
@@ -234,4 +237,3 @@ class DataShardManger(metaclass=ABCmeta):
"""Restore uncompleted data shards from a checkpoint"""
pass
```
-
diff --git a/docs/design/virtual-env.md b/docs/design/virtual-env.md
index fd5f1311d..bd6f482f1 100644
--- a/docs/design/virtual-env.md
+++ b/docs/design/virtual-env.md
@@ -3,25 +3,25 @@
## Background
The cluster mode of DLRover is designed to train deep learning model automatically
-in production environment. The correctness and robustness of DLRover is
+in production environment. The correctness and robustness of DLRover is
critical to obtain the required DL models. Therefore, there is high demand for complete and
reliable testing on DLRover before each update.
DLRover consists of multiple components and those components coordinate with each
other to train the DL models. Since DLRover can run quite different training jobs, e.g.,
-Tensorflow and PyTorch, for different DL models, the job relevant components
+Tensorflow and PyTorch, for different DL models, the job relevant components
(e.g., operators) are quite different from each other. Unit tests indeed guarantee
the function correctness of the single component. We still need to test the coordination
among different components.
-Meanwhile, new algorithms (e.g., auto-configure and optimization) are in frequent
-iteration. For a new algorithm, we need to guarantee the efficiency of this algorithm
+Meanwhile, new algorithms (e.g., auto-configure and optimization) are in frequent
+iteration. For a new algorithm, we need to guarantee the efficiency of this algorithm
as well as the correctness. However, currently we can only run several sample jobs
-on gray cluster and observe the algorithm's efficiency that usually shows inaccurate
+on gray cluster and observe the algorithm's efficiency that usually shows inaccurate
results compared to production environment. Furthermore, a new algorithm requires
non-trivial time to have complete test. During each iteration, we need to compare
multiple algorithms and pick up the best one. There is high demand to test multiple
-algorithms simultaneously.
+algorithms simultaneously.
## Design Goal
@@ -39,7 +39,7 @@ Based on the background, we have listed the design goals of virtual environment:
-Similar to virtual machines, each **Virtual Environment** can run a different "DLRover" system.
+Similar to virtual machines, each **Virtual Environment** can run a different "DLRover" system.
Among those DLRover systems, there is at most one system can be labelled
as **Real** while all others can only be **Virtual**. Generally, only the real system
is to run and optimize DL training jobs and virtual systems are for testing.
@@ -47,37 +47,38 @@ However, if a case is observed, e.g., too many jobs fail, the real system will b
switched to virtual and a pre-chosen virtual system takes the control and start to
train DL models.
-DLRover consists of four major components: Brain, Trainer, Store and Job Operator
+DLRover consists of four major components: Brain, Trainer, Store and Job Operator
(i.e., training framework related, like Tensorflow, PyTorch etc.). Each component
has its own virtual environment. Note that two virtual DLRovers could share the same
-virtual components. For example, DLRover #1 and DLRover #2 could have different virtual
+virtual components. For example, DLRover #1 and DLRover #2 could have different virtual
Trainers but use the same virtual Brain.
### Virtual Brain
-The core part in the Brain is Optimizer and Optimizer is allowed to have multiple
-different implementations. Therefore, we can implement different Optimizers for
+The core part in the Brain is Optimizer and Optimizer is allowed to have multiple
+different implementations. Therefore, we can implement different Optimizers for
different virtual Brain.
-Evaluator is another key part in Brain virtual environment. Evaluator is to measure
-or compare the efficiency of the algorithms. Similar to Optimizer, Evaluator also
+Evaluator is another key part in Brain virtual environment. Evaluator is to measure
+or compare the efficiency of the algorithms. Similar to Optimizer, Evaluator also
allows different implementations.
### Virtual Store
-Store is used to keep necessary data for DLRover. Each virtual environment can have
+Store is used to keep necessary data for DLRover. Each virtual environment can have
separate data store, e.g., tables in MySQL.
### Virtual Operator
-Operator is used to modify/kill/launch jobs virtually or really. Note that, if a virtual
+Operator is used to modify/kill/launch jobs virtually or really. Note that, if a virtual
operator update jobs virtually, it needs to obtain corresponding cluster status for those
virtual job operations.
### Virtual Trainer
Each virtual Trainer has three major tasks:
-1. To query optimization plans from virtual Brain.
+
+1. To query optimization plans from virtual Brain.
2. To convert optimization plans to ScalePlan and send to virtual operator.
3. Based on ScalePlan, to simulate job's virtual status and persist relevant data to store.
@@ -85,15 +86,12 @@ The simulator interface is as following:
```go
type JobStatus struct {
- Speed float
- ...
+ Speed float
+ ...
}
type JobSimulator interface {
- UpdateJob(plan *OptimizePlan) error
- GetJobStatus() *JobStatus
+ UpdateJob(plan *OptimizePlan) error
+ GetJobStatus() *JobStatus
}
```
-
-
-
diff --git a/docs/developer_guide.md b/docs/developer_guide.md
index c0686114d..73101b7fb 100644
--- a/docs/developer_guide.md
+++ b/docs/developer_guide.md
@@ -35,7 +35,7 @@ mkdir -p ${go env GOPATH}/src/github.com/intelligent-machine-learning
ln -sf ${GIT_TRAINING} ${go env GOPATH}/src/github.com/intelligent-machine-learning/dlrover
```
-- GIT_TRAINING should be the location where you checked out https://github.com/intelligent-machine-learning/dlrover
+- GIT_TRAINING should be the location where you checked out