Skip to content

Multi-GPU, multi-node development milestones. #5958

@helinwang

Description

@helinwang
  1. Single node multiple CPU threads
    1. Executor support multiple threads: @helinwang propose to drive this.
    2. Transpiling: convert user's input ProgramDesc to a ExecutionPlan that supports CPU-only sync-SGD data parallelism: @Yancey1989 .
  2. Single node multiple GPUs
    1. Transpiling: convert user's input ProgramDesc to a ExecutionPlan that supports GPU sync-SGD data parallelism.
  3. Multiple nodes
    1. operators for feeding data
    2. Transpiling: convert user's input ProgramDesc to a ExecutionPlan that runs on multiple nodes.
    3. Send / Recv OP
    4. ExecutionPlan partition: partition the single ExecutionPlan to multiple ExecutionPlans, each partitioned ExecutionPlan runs on a node. Send / Recv OP is added between edges that cross nodes.
  4. Fault tolerant: single node failure stops the training job and causes a job restart.
    1. every executor should save state automatically and loads state upon restart.
  5. Elastic ML: number of nodes can change without interrupting the training (training job will not stop).

Please comment if you have question or suggestions, thanks!

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions