Skip to content

Distributed training progress and todo #1820

Closed
@helinwang

Description

@helinwang

Progress:
Design docs:

TODO:
For the first level bullet points, please refer to the corresponding design doc. The second level bullet points are questions that we need to figure out.

  • Implement PaddlePaddle Server.
  • Implement master program.
  • Implement fault tolerant parameter server.
    • Do we need to rewrite parameter server? How much effort is it to add fault tolerant in C++? If the effort is bigger or equal to rewrite in golang, maybe we can rewrite in golang.
    • Do we need to support sparse parameter update in v1?
    • What kind of update rule does parameter server need to support in v1? maybe only simple "add" (no momentum based).
  • Implement fault tolerant trainer.
    • able to scale up trainer.
    • it involves changes for python part and native part (c++ or golang). We need to define a clean C api for python to use.
  • Client submit cluster training.
  • Setup etcd service (no detail specification in the design doc, will not reuse etcd from k8s, according to distributed training: should we re-use etcd from k8s? #1807).
    • How to control etcd access namespace?
  • Filesystem provisioned for each user (no detailed design yet).
  • Collect logs and display to users (no detailed design yet).
  • Upload custom dataset to cluster.
    • Do we need to support merge data files to a big custom file to speed up sequential read. This is performance in the first version?
    • How can the trainer read the dataset, and be backward / forward compatible, maybe we need a reader "driver" for each dataset?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions