paddle cloud 计划内容

### 第一阶段： 可以演示，可以开放给部分用户使用(内测)
时间点：2017-05-31

![paddle-cloud-design](https://cloud.githubusercontent.com/assets/1724178/25407467/3ef7260e-29bf-11e7-91f8-b097018947ef.png)

* 只需要支持jupiter notebook on PaddlePaddle cloud when doing distributed training.
  * Jupiter notebook需要能收到callback，包含cost，能自己画图。（就跟单机版使用方式类似，都是callback里绘图）。
  * PaddlePaddle Server --yanxu （如果第一版只支持jupiter notebook on cloud，这个就不一定需要了）
     End user's `train` function talks to PaddlePaddle server, which invokes Docker to build images.
* Paddle Cloud Web页面 -- 先出原型-- wuyi, yanxu,gongweibao
	* 在网页上编写训练代码
	* 在云端打包并提交训练任务
	* training过程可视化（cost动画）
	* job运行状态监控
	* 查看训练日志
	* 查看个人配额
	* 支持inference应用（暂不考虑）
* RBAC--wuyi
	* 使用百度账号登录（注册）-->开通账号-->配置namespace等
* 存储GlusterFS -- weibao
	* 权限和配额
	* 训练数据的上传和分片
	* 性能考虑，确定演示版本部署方案。使用内部存储系统如BFS也会带来和kubernetes的适配成本
* pserver和trainer支持扩容的功能的开发
* Network Policy网络隔离调研
	* 不影响演示，但是需要有，否则不安全 
* GPU资源 -- Done
* v2 分布式训练：强调扩容，体现出变化。
	* 训练任务扩容
		* 扩容之后体现效果展示？
- Implement master program. (helin)
  - master，trainer，pserver service discovery. (helin)
  - master trainer communication. (helin)
- Implement fault tolerant parameter server.
  - Do we need to rewrite parameter server? How much effort is it to add fault tolerant in C++? If the effort is bigger or equal to rewrite in golang, maybe we can rewrite in golang.
  - Do we need to support sparse parameter update in v1?
  - What kind of update rule does parameter server need to support in v1? maybe only simple "add" (no momentum based).
- Implement fault tolerant trainer.
  - able to scale up trainer.
  - it involves changes for python part and native part (c++ or golang). We need to define a clean C api for python to use.
- Setup etcd service (no detail specification in the design doc, will not reuse etcd from k8s, according to https://github.com/PaddlePaddle/Paddle/issues/1807).
    - How to control etcd access namespace?
- Upload custom dataset to cluster.
    - Do we need to support merge data files to a big custom file to speed up sequential read. This is performance in the first version?
    - How can the trainer read the dataset, and be backward / forward compatible, maybe we need a reader "driver" for each dataset?

风险：
	- 长期考虑，高性能存储需要深度支持。
	- Web页面的开发有工作量，人员少

------------------
Design docs:
- https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/cluster_train:
    - Fault tolerant master program
    - Fault tolerant parameter server
    - Fault tolerant trainer
   - Parameter server checkpointing / recover from checkpoint file.
   - Upload custom dataset to cluster / how can the reader use the uploaded dataset.
- https://github.com/PaddlePaddle/Paddle/pull/1770 (PR review in progress)
   - How to submit cluster training job from client.
   - PaddlePaddle Server
------------------
4/24/2017 meeting minutes:
scope information for first version:
pserver：
    ⁃    只考虑TCP，不支持RDMA
    ⁃    不考虑sparse
    ⁃    支持trainer动态伸缩
    ⁃    同步 SGD

trainer：
    ⁃    pserver client
    ⁃    fetch taskid，按task处理数据
    ⁃    动态伸缩，demo强调扩容，体现出变化。

master：
    ⁃    服务发现
    ⁃    分配task

paddle server：
    ⁃    build docker image on Kubernetes
    ⁃    启动paddle job

paddle client：
    ⁃    提交集群任务（python代码, add an optional argument for paddle.train, which contains dist train configuration.）
    ⁃    命令行 paddle upload/download

- 不允许用户在分布式训练里画图，只能打印log。
- Paddle会提供cost的动态图表。
- Parameter Server是否需要重写需要更多调研。
- PR问题达成了一致。
- 分工（请参考issue评论，4/24/2017的plan）

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

paddle cloud 计划内容 #1860

第一阶段：可以演示，可以开放给部分用户使用(内测)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

paddle cloud 计划内容 #1860

Description

第一阶段： 可以演示，可以开放给部分用户使用(内测)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

第一阶段：可以演示，可以开放给部分用户使用(内测)