-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Closed
Description
- Single node multiple CPU threads
- Executor support multiple threads: @helinwang propose to drive this.
- Transpiling: convert user's input
ProgramDescto aExecutionPlanthat supports CPU-only sync-SGD data parallelism: @Yancey1989 .
- Single node multiple GPUs
- Transpiling: convert user's input
ProgramDescto aExecutionPlanthat supports GPU sync-SGD data parallelism.
- Transpiling: convert user's input
- Multiple nodes
- operators for feeding data
- Transpiling: convert user's input
ProgramDescto aExecutionPlanthat runs on multiple nodes. - Send / Recv OP
ExecutionPlanpartition: partition the singleExecutionPlanto multipleExecutionPlans, each partitionedExecutionPlanruns on a node. Send / Recv OP is added between edges that cross nodes.
- Fault tolerant: single node failure stops the training job and causes a job restart.
- every executor should save state automatically and loads state upon restart.
- Elastic ML: number of nodes can change without interrupting the training (training job will not stop).
Please comment if you have question or suggestions, thanks!
Yancey0623, typhoonzero, gongweibao and qingqing01
Metadata
Metadata
Labels
No labels