Closed
Description
Progress:
Design docs:
- https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/dist/README.md:
- Fault tolerant master program
- Fault tolerant parameter server
- Fault tolerant trainer
- Paddle cluster design #1696 (PR review in progress)
- Parameter server checkpointing / recover from checkpoint file.
- Upload custom dataset to cluster / how can the reader use the uploaded dataset.
- Design doc: submit a distributed job #1770 (PR review in progress)
- How to submit cluster training job from client.
- PaddlePaddle Server.
TODO:
For the first level bullet points, please refer to the corresponding design doc. The second level bullet points are questions that we need to figure out.
- Implement PaddlePaddle Server.
- Implement master program.
- Implement fault tolerant parameter server.
- Do we need to rewrite parameter server? How much effort is it to add fault tolerant in C++? If the effort is bigger or equal to rewrite in golang, maybe we can rewrite in golang.
- Do we need to support sparse parameter update in v1?
- What kind of update rule does parameter server need to support in v1? maybe only simple "add" (no momentum based).
- Implement fault tolerant trainer.
- able to scale up trainer.
- it involves changes for python part and native part (c++ or golang). We need to define a clean C api for python to use.
- Client submit cluster training.
- Setup etcd service (no detail specification in the design doc, will not reuse etcd from k8s, according to distributed training: should we re-use etcd from k8s? #1807).
- How to control etcd access namespace?
- Filesystem provisioned for each user (no detailed design yet).
- Collect logs and display to users (no detailed design yet).
- Upload custom dataset to cluster.
- Do we need to support merge data files to a big custom file to speed up sequential read. This is performance in the first version?
- How can the trainer read the dataset, and be backward / forward compatible, maybe we need a reader "driver" for each dataset?
Metadata
Metadata
Assignees
Labels
No labels