Skip to content

How to implement DataParallelEngine #2749

Closed
@QiJune

Description

@QiJune

We should support a Net running on multi-GPUs. And users can just define a Net and set GPU ids, and the parallel running on multi-GPUs will be automatic.

In caffe2, NCCL and gloo are used to support multi-GPUs on multi-Servers. And both the operations in NCCL and gloo are represented as Operator.

In paddle now, we have implemented MultiGradientMachine and pserver. We might use NCCL to merge gradient in multi-GPUs in our new version. And should we take NCCL operations as Operator?

If NCCL operation is Operator, then one Net might corresponds to multi-GPUs. Or, just we take NCCL operation as a function, then we will have one Net corresponds to one GPU.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions