Closed
Description
第一阶段: 可以演示,可以开放给部分用户使用(内测)
时间点:2017-05-31
- 只需要支持jupiter notebook on PaddlePaddle cloud when doing distributed training.
- Jupiter notebook需要能收到callback,包含cost,能自己画图。(就跟单机版使用方式类似,都是callback里绘图)。
- PaddlePaddle Server --yanxu (如果第一版只支持jupiter notebook on cloud,这个就不一定需要了)
End user'strain
function talks to PaddlePaddle server, which invokes Docker to build images.
- Paddle Cloud Web页面 -- 先出原型-- wuyi, yanxu,gongweibao
- 在网页上编写训练代码
- 在云端打包并提交训练任务
- training过程可视化(cost动画)
- job运行状态监控
- 查看训练日志
- 查看个人配额
- 支持inference应用(暂不考虑)
- RBAC--wuyi
- 使用百度账号登录(注册)-->开通账号-->配置namespace等
- 存储GlusterFS -- weibao
- 权限和配额
- 训练数据的上传和分片
- 性能考虑,确定演示版本部署方案。使用内部存储系统如BFS也会带来和kubernetes的适配成本
- pserver和trainer支持扩容的功能的开发
- Network Policy网络隔离调研
- 不影响演示,但是需要有,否则不安全
- GPU资源 -- Done
- v2 分布式训练:强调扩容,体现出变化。
- 训练任务扩容
- 扩容之后体现效果展示?
- 训练任务扩容
- Implement master program. (helin)
- master,trainer,pserver service discovery. (helin)
- master trainer communication. (helin)
- Implement fault tolerant parameter server.
- Do we need to rewrite parameter server? How much effort is it to add fault tolerant in C++? If the effort is bigger or equal to rewrite in golang, maybe we can rewrite in golang.
- Do we need to support sparse parameter update in v1?
- What kind of update rule does parameter server need to support in v1? maybe only simple "add" (no momentum based).
- Implement fault tolerant trainer.
- able to scale up trainer.
- it involves changes for python part and native part (c++ or golang). We need to define a clean C api for python to use.
- Setup etcd service (no detail specification in the design doc, will not reuse etcd from k8s, according to distributed training: should we re-use etcd from k8s? #1807).
- How to control etcd access namespace?
- Upload custom dataset to cluster.
- Do we need to support merge data files to a big custom file to speed up sequential read. This is performance in the first version?
- How can the trainer read the dataset, and be backward / forward compatible, maybe we need a reader "driver" for each dataset?
风险:
- 长期考虑,高性能存储需要深度支持。
- Web页面的开发有工作量,人员少
Design docs:
- https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/cluster_train:
- Fault tolerant master program
- Fault tolerant parameter server
- Fault tolerant trainer
- Parameter server checkpointing / recover from checkpoint file.
- Upload custom dataset to cluster / how can the reader use the uploaded dataset.
- Design doc: submit a distributed job #1770 (PR review in progress)
- How to submit cluster training job from client.
- PaddlePaddle Server
4/24/2017 meeting minutes:
scope information for first version:
pserver:
⁃ 只考虑TCP,不支持RDMA
⁃ 不考虑sparse
⁃ 支持trainer动态伸缩
⁃ 同步 SGD
trainer:
⁃ pserver client
⁃ fetch taskid,按task处理数据
⁃ 动态伸缩,demo强调扩容,体现出变化。
master:
⁃ 服务发现
⁃ 分配task
paddle server:
⁃ build docker image on Kubernetes
⁃ 启动paddle job
paddle client:
⁃ 提交集群任务(python代码, add an optional argument for paddle.train, which contains dist train configuration.)
⁃ 命令行 paddle upload/download
- 不允许用户在分布式训练里画图,只能打印log。
- Paddle会提供cost的动态图表。
- Parameter Server是否需要重写需要更多调研。
- PR问题达成了一致。
- 分工(请参考issue评论,4/24/2017的plan)
Metadata
Metadata
Assignees
Labels
No labels