Distributed training progress and todo

Progress:
Design docs:
- https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/dist/README.md:
    - Fault tolerant master program
    - Fault tolerant parameter server
    - Fault tolerant trainer
- https://github.com/PaddlePaddle/Paddle/pull/1696 (PR review in progress)
   - Parameter server checkpointing / recover from checkpoint file.
   - Upload custom dataset to cluster / how can the reader use the uploaded dataset.
- https://github.com/PaddlePaddle/Paddle/pull/1770 (PR review in progress)
   - How to submit cluster training job from client.
   - PaddlePaddle Server.

TODO:
For the first level bullet points, please refer to the corresponding design doc. The second level bullet points are questions that we need to figure out.

- Implement PaddlePaddle Server.
- Implement master program.
- Implement fault tolerant parameter server.
  - Do we need to rewrite parameter server? How much effort is it to add fault tolerant in C++? If the effort is bigger or equal to rewrite in golang, maybe we can rewrite in golang.
  - Do we need to support sparse parameter update in v1?
  - What kind of update rule does parameter server need to support in v1? maybe only simple "add" (no momentum based).
- Implement fault tolerant trainer.
  - able to scale up trainer.
  - it involves changes for python part and native part (c++ or golang). We need to define a clean C api for python to use.
- Client submit cluster training.
- Setup etcd service (no detail specification in the design doc, will not reuse etcd from k8s, according to https://github.com/PaddlePaddle/Paddle/issues/1807).
    - How to control etcd access namespace?
- Filesystem provisioned for each user (no detailed design yet).
- Collect logs and display to users (no detailed design yet).
- Upload custom dataset to cluster.
    - Do we need to support merge data files to a big custom file to speed up sequential read. This is performance in the first version?
    - How can the trainer read the dataset, and be backward / forward compatible, maybe we need a reader "driver" for each dataset?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed training progress and todo #1820

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Distributed training progress and todo #1820

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions