-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add DeviceContext design doc #2648
Conversation
doc/design/context.md
Outdated
|
||
~DeviceGuard() noexcept { | ||
cudaError_t err = cudaSetDevice(previous_); | ||
PADDLE_ASSERT(err == cudaSuccess); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is obviously not noexcept
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
doc/design/context.md
Outdated
|
||
The future DAGNet is run by multi-threads. And each thread will have its own Eigen::GpuDevice object binding on different CudaStream. Multi-threads can run parallelly on a same GPU card. | ||
|
||
And Copy(Communication) work will be in charge of specific thread. The copy thread will only get CudaStream from corresponding Context. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not quite understand about this line, is it
specific thread is in charge of Copy(Communication) work
?
and what is Copy(Communication) work
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, specific thread is in charge of of Copy(Communication) work
is right.
It means the data copy between cpu/gpu or between gpus.
doc/design/context.md
Outdated
## Context Design | ||
|
||
|
||
A Net is executed by single or several threads. A Context is related to a thread and records necessary runtime resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no explanation here why should Context, should point out the relationship between Context and Operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
doc/design/context.md
Outdated
return blas_handle_; | ||
} | ||
|
||
cudnnHandle_t GetDnnHandle() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe GetCUDNNHandle is more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
doc/design/context.md
Outdated
|
||
~CUDAContext() { | ||
Wait(); | ||
cudaError_t err = cudaStreamDestroy(stream_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the destructor of the stream should be placed at the end. Because this is the first to be constructed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
doc/design/context.md
Outdated
## Context Design | ||
|
||
|
||
A Net is executed by single or several threads. A Context is related to a thread and records necessary runtime resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
records -> holds? sounds weird
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
doc/design/context.md
Outdated
Context is defined as follows: | ||
|
||
``` | ||
class Context {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
struct Context;
may be better, the Context
is a lightweight data structure, users access its member data, without rich member functions needed. So a struct
is enough.
reference mxnet::Context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Context
defined here is more like RunContext
in mxnet
doc/design/context.md
Outdated
|
||
### CUDAContext | ||
|
||
Because the Tensor computation are executed by Eigen library, which needs an Eigen::GpuDevice type object as parameter. And the GpuDevice parameter is constructed with an Eigen::CudaStreamDevice object. We need to set a specific GpuID and CudaStream to create a Eigen::CudaStream object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> which needs an Eigen::GpuDevice type object as a parameter
. And -> , and
. We -> , we
We need to set both a specific GpuID and an CudaStream to create a Eigen::CudaStream object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
doc/design/context.md
Outdated
|
||
At the same time, some computation work will executed by cublas or cudnn library. Take cublas library as an example, we have to acquire a cublasHandle which binds on a CudaStream to make computation. It's the same way as Eigen library does. | ||
|
||
The future DAGNet is run by multi-threads. And each thread will have its own Eigen::GpuDevice object binding on different CudaStream. Multi-threads can run parallelly on a same GPU card. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all ". And" -> ", and"
on different CudaStream. Multi-threads can run parallelly on a same GPU card.
->
on different CudaStream**, so that** multi-threads can run parallelly on the same GPU card.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
doc/design/context.md
Outdated
|
||
- Differnet GPU cards have different GpuID, and we can do data parallelism on multi-GPUs. | ||
- Multi-threads can run a Net parallelly on a single GPU card, and each thread has one Context. | ||
- There is also single thread executing a Net sequentially. All computation and communication work will use same Context. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use same -> use the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
doc/design/context.md
Outdated
Context is defined as follows: | ||
|
||
``` | ||
class Context {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no common methods shared between GpuContext and CpuContext?
struct Context {
int dev_id{0}; // CPU = 0
enum DevType {
kCPU;
kGPU;
};
DevType dev_type;
enum StreamType {
kCUDNN;
kBLAS;
kCUDA;
};
// all the streams are created globally, so a void* is enough
// idx is the index of stream of `type`, each thread in a device can have one idx.
void* GetStream(StreamType type, int stream_idx);
// allocate one stream for each StreamType and thread, in all the devices.
static void** streams;
};
Both CpuContext
and GpuContext
are Context
, so it's wired that they have no similarity.
And we can merge them into one Context
, really need two?
If the CpuContext
and GpuContext
will be used as a template parameter, can Place
repalce their role ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Context
is just for unifying two class CUDAContext
and CPUContext
. And RunContext
in mxnet is also for unifying two class Stream<gpu>
and Stream<cpu>
.
doc/design/context.md
Outdated
|
||
At the same time, some computation work will executed by cuBLAS or cuDNN library. Take cuBLAS library as an example, we have to acquire a cublasHandle which binds on a CudaStream to make computation. It's the same way as Eigen library does. | ||
|
||
A `Context` is corresponded to a thread and holds runtime resources, such as CudaStream, cublasHandle and so on. And the `Run` method of `Operator` will take `Context` from a specific thread as a parameter to get necessary runtime resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why a Context
must corresponded to a thread?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Context
here is actually thread Context. The Net is executed by single or several threads, and each thread will have its own runtime resources. Maybe ThreadContext is a better name to understand.
doc/design/context.md
Outdated
Context is defined as follows(just using for unify class CUDAContext and class CPUContext): | ||
|
||
``` | ||
class Context {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, how do we use CPUContext or GPUContext from a Context*
?
If we want to use dynamic_cast<GPUContext*>(context);
, Context
shoud have a virtual destructor.
struct Context {
virtual ~Context() {}
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right. Fix it later
doc/design/context.md
Outdated
```c++ | ||
class DeviceGuard { | ||
public: | ||
explicit DeviceGuard(int newDevice) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should DeviceGuard take Place
as the argument? Like
class GPUPlaceGuard {
public:
explicit GPUPlaceGuard(GPUPlace place);
};
It could make our design consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR.
I am wondering about one basic question -- could we restrict to run a net on a single device, and to make use of multiple devices by introducing class DataParallelEngine
?
It seems that if we do so, we can simplify a lot of things, at least for right now?
`Net` is the container and controller of a set of `Operator`. Each `Operator` in `Net` has a method called `Run` to make computation. The `Run` method of `Operator` is defined as follows: | ||
|
||
```c++ | ||
Error Operator::Run(OpContext* context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of errors could Operator::Run
return? What actions could the caller take in response to these errors?
I ask because I always tend to handle errors inside the function instead of leaving errors to the caller, who has less information as the callee does and usually cannot do much with the returned errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Run interface of operator is changed and do not return error any more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I will fix it later.
|
||
At the same time, some computation work will executed by cuBLAS or cuDNN library. Take cuBLAS library as an example, we have to acquire a cublasHandle which binds on a CudaStream to make computation. It's the same way as Eigen library does. | ||
|
||
`DeviceContext` is defined as follows(just using for unify class `CudaDeviceContext` and class `CpuDeviceContext`): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If DeviceContext has multiple sub-classes, it implies that these sub-classes share some common behaviors (or methods). So it looks strange that the following definition of DeviceContext doesn't have virtual methods other than the destructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we need stream, cublasHandle, cudnnHandle in GPU, but nearly nothing in CPU.
So, CUPDeviceContext and CUDADeviceContext can hardly have something in common.
The DeviceContext is just used for unify two types and make it convenient passing parameter to Operator.
In caffe2, except for memory New and Delete,CUPContext
and CUDAContext
nearly have no methods in common. (https://github.com/caffe2/caffe2/blob/master/caffe2/core/context.h#L124)
In mxnet, Stream<cpu>
and Stream<gpu>
nearly have no methods in common too. The Stream in mxnet holds runtime resoureces. (https://github.com/dmlc/mshadow/blob/20b54f068c1035f0319fa5e5bbfb129c450a5256/mshadow/tensor.h#L370)
} | ||
|
||
~DeviceGuard() { | ||
cudaError_t err = cudaSetDevice(previous_.device); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that we can write:
PADDLE_ENFORCE(cudaSetDevice(previous_.device));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, have fixed it in PR #2709.
eigen_handle_ = new Eigen::GpuDevice(eigen_stream_); | ||
} | ||
|
||
void Wait() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This two-line function is only called by the destructor, how about move these two lines into the destructor and save this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Wait method is to synchronize a CUDA stream. And this method may be called also in other circumstances.
GPUPlace previous_; | ||
}; | ||
|
||
class CudaDeviceContext : public DeviceContext{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cuda => CUDA. It is an acronym.
|
||
### CpuDeviceContext | ||
|
||
CpuDeviceContext is defined as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cpu => CPU
}; | ||
``` | ||
|
||
The `Run` method will take `Variable` (containing `Tensor`) from `Scope` to make computation on certain `DeviceContext`. `DeviceContext` provides necessary runtime resources for computation, including CudaStream, cublasHandle and so on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it something like Operator::Run
can call Operator::Input(i)
to get the i-th input, which should have been created by the feeding operator or some other one before the invocation of Operator::Run
?
If so, it seems that Operator::Run
has to run on the same device as where the input resides; otherwise, there would be unnecessary data copying from where the input has been to where Run
is on.
Given that we don't want such kind of inefficient copying, we should make sure that all operators of a network run on the same device, and Operator::Run
should take a constant reference to the context, like
void Operator::Run(const Context& ctx);
and Context should refer to a single specific device (like GPU) other than a set of GPUs?
I think we can do so actually, just class Net
a simple lightweight class without sub-classes. If we want to make use of multiple GPUs, we can create class DataParallelEngine
which runs multiple clones of a net, each on a GPU and by a thread, and aggregates gradient tensors of model parameters. A simpler engine could be class SingleThreadEngine
which runs a net on a specific device using one thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first, you are right, the Operator::Run
will run on the same device.
However, the Context
which binds on a specific GPU card is not enough for efficient Net execution. The Operator::Run
usually binds on a specific CUDA stream on a specific GPU card.
If we just want to implement a SimpleNet
, which all the operators will execute sequentially, only one CUDA stream is enough.
But, if we also want to implement a DAGNet
, which the operators can execute in parallel, we must have to create several CUDA stream on a GPU card.
So, we may have a single thread or several threads to execute an Net
, and each thread will hold it own CUDA stream.
There will be two ways to implement DAGNet
.
- We pass a
Context
parameter to theOperator
, and theContext
must have several CUDA stream.
struct Context {
int device_id;
std::vector<cudaStream_t> streams;
};
void Operator::Run(const Context& ctx);
And the Operator have to choose one CUDA stream to run.
- We pass
OpContext
toOperator
struct OpContext {
int device_id;
cudaStream_t stream;
};
void Operator::Run(OpContext& ctx);
And the OpContext will be passed to the Operator, and may be created by many threads. And all the threads are managed and scheduled by engine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are two ways to get input when run an operator:
- caffe use
Operator::Input(i)
- tf use
void Compute(OpKernelContext* context) override {
const Tensor& input = context->input(0);
const Tensor& bias = context->input(1);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is exactly that we might not need DAGNet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we may implement a DataParallelEngine
which runs on multi-GPUs. And, I think that in a single GPU card, we also have to implement both SingleThreadEngine
and DAGEngine
.
Since DAGEngine
will created many threads, and each will bind on a specific CUDA stream. I suggest we take OpContext
as the parameter of Operator::Run
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am writing a simple engine according to mxnet::engine's design document in spare time.
https://github.com/Superjom/NaiveEngine
It is nearly finished, and I am writing more UT to make sure that it works in large throughput.
Currently, a DebugEngine
is an engine without thread pool, a MultiThreadEnginePooled
with a thread pool of only one worker is the case:
A simpler engine could be class
SingleThreadEngine
which runs a net on a specific device using one thread.
The code is much less than original mxnet's, and logic simpler
Just add a choice :-) @wangkuiyi @reyoung
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At now, we may not need DAGNet.
But the Run
method is actually executed on a specific CUDA stream. I think pass a CUDA stream to Operator is more flexible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest that let's target at SingleThreadEngine if it's too much discussion would take too much time that we cannot afford. Let's assume that a net runs on a single GPU/CPU by one thread only. We can keep this design for later upgrading to more complicated engines.
|
||
|
||
```c++ | ||
class DeviceGuard { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we restrict that a net runs on a single device, and DataParallelEngine
makes use of multiple GPUs, does that mean that device switching could be defined inside of DataParallelEngine
, or even its sub-class MultiCUDAEngine
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DeviceGuard is just use for reduce the memory burden of developers. The DeviceGuard ensure we call some CUDA API in right GPU device.
If not, the developer have to switch to right GPU device before call some CUDA API.
Done. |
#2607