Skip to content

Fluid distributed training TODO #10279

Closed
Closed
@Yancey0623

Description

@Yancey0623

Fluid Distribute Training Features

EDL

  • implement the master process to schedule task
  • etcd operator
  • implement CRD to support kubernetes v1.8

Support different communication library

  • gRPC performance enhancement
  • OpenMPI with RDMA and GPU direct
  • NCCL2 with multiple nodes
  • follow up bRPC

Experiment

  • different distributed training strategy (sync, async etc...) influence on model accuracy/throughput

CE

  • Auto execute benchmark-job on AWS and generate a report

Future

  • differences between multi-machine-single-device and multi-machine-multi-device
  • better integration with single-machine training
  • think about more flexible user-customized device placement for multi-machine training.
  • need to discuss whether we need the remote executor

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions