Multi-GPU, multi-node development milestones.

1. Single node multiple CPU threads
    1. Executor support multiple threads: @helinwang propose to drive this.
    1. Transpiling: convert user's input `ProgramDesc` to a `ExecutionPlan` that supports CPU-only sync-SGD data parallelism: @Yancey1989 .
1. Single node multiple GPUs
    1. Transpiling: convert user's input `ProgramDesc` to a `ExecutionPlan` that supports GPU sync-SGD data parallelism.
1. Multiple nodes
    1. operators for feeding data
    1. Transpiling: convert user's input `ProgramDesc` to a `ExecutionPlan` that runs on multiple nodes.
    1. Send / Recv OP
    1. `ExecutionPlan` partition: partition the single `ExecutionPlan` to multiple `ExecutionPlans`, each partitioned `ExecutionPlan` runs on a node. Send / Recv OP is added between edges that cross nodes.
1. Fault tolerant: single node failure stops the training job and causes a job restart.
    1. every executor should save state automatically and loads state upon restart.
1. Elastic ML: number of nodes can change without interrupting the training (training job will not stop).

Please comment if you have question or suggestions, thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-GPU, multi-node development milestones. #5958

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multi-GPU, multi-node development milestones. #5958

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions