Description
When launching a kernel in CUDA, we have to set a specific CUDA stream. We have implement DeviceContext which has a CUDA stream. Copy and operator run job will launch a CUDA kernel. So we have to pass a DeviceContext parameter to Copy and Operator job.
In user end, users will only set CPU or several GPU ids to run network training, they are not aware of DeviceContext. So, we have to implement a DeviceContextManager to initialize some DeviceContext at first, and schedule the CDUA kernels.
And OperatorBase may need another run interface,
class OperatorBase {
void Run(Place p);
void Run(Scope, DeviceContext);
};
The first Run interface is to users, they will only need to set a place, and paddle will pass the DeviceContext to the second Run interface.
It's the same with copy jobs. A specific CUDA stream will pass to a copy job and it's scheduled by DeviceContextManager.