Fluid distributed training TODO

## Fluid Distribute Training Features

- code cleanup and polish 
- implement LARS to improve training performance, #6811
- fault-tolerant
    - checkpointing and recovering parameters on pserver
    - recover reader offset(may need master and etcd)
    - trainer pre-fetch parameters from pserver after the restart
- async training, #9941 
- distributed data reader(should unify with single machine reader)
- calculate global AUC with the distributed table
- initialize trainable parameters from saved parameters on a trainer
- ring-base architecture to improve training performance
- distributed lookup table, https://github.com/PaddlePaddle/Paddle/projects/56
- full overlapping with parallel-executor on dist training
- split send_op into multiple send_vars_op and fetch_vars_op, #9161

## EDL
- implement the master process to schedule task
- etcd operator
- implement CRD to support kubernetes v1.8

## Support different communication library 
- gRPC performance enhancement
- OpenMPI with RDMA and GPU direct
- NCCL2 with multiple nodes
- follow up bRPC

## Experiment
- different distributed training strategy (sync, async etc...) influence on model accuracy/throughput

## CE
- Auto execute benchmark-job on AWS and generate a report

## Future
- differences between multi-machine-single-device and multi-machine-multi-device
- better integration with single-machine training
- think about more flexible user-customized device placement for multi-machine training.
- need to discuss whether we need the remote executor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fluid distributed training TODO #10279

Fluid Distribute Training Features

EDL

Support different communication library

Experiment

CE

Future

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fluid distributed training TODO #10279

Description

Fluid Distribute Training Features

EDL

Support different communication library

Experiment

CE

Future

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions