-
Notifications
You must be signed in to change notification settings - Fork 5.7k
update cluster_train page #8765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update cluster_train page #8765
Conversation
doc/howto/cluster/index_cn.rst
Outdated
@@ -1,7 +1,9 @@ | |||
分布式训练 | |||
========== | |||
|
|||
本节将介绍如何使用PaddlePaddle在不同的集群框架下完成分布式训练。分布式训练架构如下图所示: | |||
深度学习模型的效果好坏与数据量的大小往往有直接的关系,相同的模型,在增大训练数据集后一般都能取得更好的效果。但是当数据量增大到一定程度后,单台计算机已经难以承受,这时,使用对台计算机进行分布式训练就是一个很自然的解决方案。在分布式训练中,训练数据被分割为多份,参与训练的多台机器分别读取自己的数据进行训练,并协同对整体模型的参数进行更新。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 修改标点。
深度学习模型的效果好坏与数据量的大小往往有直接的关系:相同的模型,在增大训练数据集后一般都能取得更好的效果。但是当数据量增大到一定程度后,单台计算机已经难以承受。这时, - 使用对台-》使用多台
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
doc/howto/cluster/index_cn.rst
Outdated
@@ -10,13 +12,25 @@ | |||
- 计算节点(Trainer): 每个trainer启动后读取切分好的一部分数据,开始神经网络的“前馈”和“后馈”计算,并和参数服务器通信。在完成一定量数据的训练后,上传计算得出的梯度(gradients),然后下载优化更新后的神经网络参数(parameters)。 | |||
- 参数服务器(Parameter server):每个参数服务器只保存整个神经网络所有参数的一部分。参数服务器接收从计算节点上传的梯度,并完成参数优化更新,再将更新后的参数下发到每个计算节点。 | |||
|
|||
这样,通过计算节点和参数服务器的分布式协作,可以完成神经网络的SGD方法的训练。PaddlePaddle可以同时支持同步随机梯度下降(SGD)和异步随机梯度下降。 | |||
通过计算节点和参数服务器的分布式协作,可以完成神经网络的SGD方法的训练。PaddlePaddle可以同时支持同步随机梯度下降(SGD)和异步随机梯度下降。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同步随机梯度下降需要在第一次出现SGD的时候进行解释。
可以完成神经网络的同步随机梯度下降(SGD)方法的训练。PaddlePaddle可以同时支持同步随机梯度下降(SGD)和异步随机梯度下降(ASGD)。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
doc/howto/cluster/index_cn.rst
Outdated
|
||
在使用同步SGD训练神经网络时,PaddlePaddle使用同步屏障(barrier),使梯度的提交和参数的更新按照顺序方式执行。在异步SGD中,则并不会等待所有trainer提交梯度才更新参数,这样极大地提高了计算的并行性:参数服务器之间不相互依赖,并行地接收梯度和更新参数,参数服务器也不会等待计算节点全部都提交梯度之后才开始下一步,计算节点之间也不会相互依赖,并行地执行模型的训练。可以看出,虽然异步SGD方式会提高参数更新并行度, 但是并不能保证参数同步更新,在任意时间某一台参数服务器上保存的参数可能比另一台要更新,与同步SGD相比,梯度会有噪声。 | ||
在开始集群训练之前,需要先进行机器配置、集群PaddlePaddle安装等准备工作,了解如何通过这些步骤来配置分布式训练所需的基本环境: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
机器配置-》集群配置
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
doc/howto/cluster/index_cn.rst
Outdated
cmd_argument_cn.md | ||
|
||
PaddlePaddle可以兼容各种不同的集群。每种集群各有优势,使用的具体方式也略有区别: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
略有区别-》有区别(因为区别还是挺大的,“略”字去掉)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
… dev_update_cluster_doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks very much!
solves #8215