add DeviceContext design doc #2648

QiJune · 2017-06-28T09:45:35Z

reyoung · 2017-06-28T09:55:22Z

doc/design/context.md

+
+  ~DeviceGuard() noexcept {
+    cudaError_t err = cudaSetDevice(previous_);
+    PADDLE_ASSERT(err == cudaSuccess);


This is obviously not noexcept.

jacquesqiao · 2017-06-28T10:05:57Z

doc/design/context.md

+
+The future DAGNet is run by multi-threads. And each thread will have its own Eigen::GpuDevice object binding on different CudaStream. Multi-threads can run parallelly on a same GPU card.
+
+And Copy(Communication) work will be in charge of specific thread. The copy thread will only get CudaStream from corresponding Context.


not quite understand about this line, is it
specific thread is in charge of Copy(Communication) work?
and what is Copy(Communication) work?

Yes, specific thread is in charge of of Copy(Communication) work is right.
It means the data copy between cpu/gpu or between gpus.

hedaoyuan · 2017-06-28T09:52:19Z

doc/design/context.md

+## Context Design
+
+
+A Net is executed by single or several threads. A Context is related to a thread and records necessary runtime resources.


There is no explanation here why should Context, should point out the relationship between Context and Operator.

hedaoyuan · 2017-06-28T11:02:50Z

doc/design/context.md

+    return blas_handle_;
+  }
+
+  cudnnHandle_t GetDnnHandle() {


Maybe GetCUDNNHandle is more clear.

hedaoyuan · 2017-06-28T11:07:10Z

doc/design/context.md

+
+  ~CUDAContext() {
+    Wait();
+    cudaError_t err = cudaStreamDestroy(stream_);


Maybe the destructor of the stream should be placed at the end. Because this is the first to be constructed.

Superjomn · 2017-06-29T00:57:46Z

doc/design/context.md

+## Context Design
+
+
+A Net is executed by single or several threads. A Context is related to a thread and records necessary runtime resources.


records -> holds? sounds weird

Superjomn · 2017-06-29T01:03:12Z

doc/design/context.md

+Context is defined as follows:
+
+```
+class Context {};


struct Context;
may be better, the Context is a lightweight data structure, users access its member data, without rich member functions needed. So a struct is enough.

reference mxnet::Context

The Context defined here is more like RunContext in mxnet

Superjomn · 2017-06-29T01:05:28Z

doc/design/context.md

+
+### CUDAContext
+
+Because the Tensor computation are executed by Eigen library, which needs an Eigen::GpuDevice type object as  parameter. And the GpuDevice parameter is constructed with an Eigen::CudaStreamDevice object. We need to set a specific GpuID and CudaStream to create a Eigen::CudaStream object.


-> which needs an Eigen::GpuDevice type object as a parameter

. And -> , and
. We -> , we

We need to set both a specific GpuID and an CudaStream to create a Eigen::CudaStream object.

Superjomn · 2017-06-29T01:12:13Z

doc/design/context.md

+
+At the same time, some computation work will executed by cublas or cudnn library. Take cublas library as an example, we have to acquire a cublasHandle which binds on a CudaStream to make computation. It's the same way as Eigen library does.
+
+The future DAGNet is run by multi-threads. And each thread will have its own Eigen::GpuDevice object binding on different CudaStream. Multi-threads can run parallelly on a same GPU card.


all ". And" -> ", and"

on different CudaStream. Multi-threads can run parallelly on a same GPU card.
->
on different CudaStream**, so that** multi-threads can run parallelly on the same GPU card.

Superjomn · 2017-06-29T01:15:59Z

doc/design/context.md

+
+- Differnet GPU cards have different GpuID, and we can do data parallelism on multi-GPUs.
+- Multi-threads can run a Net parallelly on a single GPU card, and each thread has one Context.
+- There is also single thread executing a Net sequentially. All computation and communication work will use same Context.


use same -> use the same

Superjomn · 2017-06-29T01:30:01Z

doc/design/context.md

+Context is defined as follows:
+
+```
+class Context {};


no common methods shared between GpuContext and CpuContext?

struct Context { int dev_id{0}; // CPU = 0 enum DevType { kCPU; kGPU; }; DevType dev_type; enum StreamType { kCUDNN; kBLAS; kCUDA; }; // all the streams are created globally, so a void* is enough // idx is the index of stream of `type`, each thread in a device can have one idx. void* GetStream(StreamType type, int stream_idx); // allocate one stream for each StreamType and thread, in all the devices. static void** streams; };

Both CpuContext and GpuContext are Context, so it's wired that they have no similarity.

And we can merge them into one Context , really need two?
If the CpuContext and GpuContext will be used as a template parameter, can Place repalce their role ?

The Context is just for unifying two class CUDAContext and CPUContext. And RunContext in mxnet is also for unifying two class Stream<gpu> and Stream<cpu>.

reyoung · 2017-07-01T02:17:40Z

doc/design/context.md

+
+At the same time, some computation work will executed by cuBLAS or cuDNN library. Take cuBLAS library as an example, we have to acquire a cublasHandle which binds on a CudaStream to make computation. It's the same way as Eigen library does.
+
+A `Context` is corresponded to a thread and holds runtime resources, such as CudaStream, cublasHandle and so on. And the `Run` method of `Operator` will take `Context` from a specific thread as a parameter to get necessary runtime resources.


Why a Context must corresponded to a thread?

The Context here is actually thread Context. The Net is executed by single or several threads, and each thread will have its own runtime resources. Maybe ThreadContext is a better name to understand.

reyoung · 2017-07-01T02:21:40Z

doc/design/context.md

+Context is defined as follows(just using for unify class CUDAContext and class CPUContext):
+
+```
+class Context {};


So, how do we use CPUContext or GPUContext from a Context*?

If we want to use dynamic_cast<GPUContext*>(context);, Context shoud have a virtual destructor.

struct Context { virtual ~Context() {} };

Yes, you are right. Fix it later

reyoung · 2017-07-01T02:23:48Z

doc/design/context.md

+```c++                                   
+class DeviceGuard {
+ public:
+  explicit DeviceGuard(int newDevice)


Should DeviceGuard take Place as the argument? Like

class GPUPlaceGuard { public: explicit GPUPlaceGuard(GPUPlace place); };

It could make our design consistent.

wangkuiyi

Thanks for this PR.

I am wondering about one basic question -- could we restrict to run a net on a single device, and to make use of multiple devices by introducing class DataParallelEngine?

It seems that if we do so, we can simplify a lot of things, at least for right now?

wangkuiyi · 2017-07-05T07:01:28Z

doc/design/context.md

+`Net` is the container and controller of a set of `Operator`. Each `Operator` in `Net` has a method called `Run` to make computation. The `Run` method of `Operator` is defined as follows:
+
+```c++
+Error Operator::Run(OpContext* context);


What kind of errors could Operator::Run return? What actions could the caller take in response to these errors?

I ask because I always tend to handle errors inside the function instead of leaving errors to the caller, who has less information as the callee does and usually cannot do much with the returned errors.

The Run interface of operator is changed and do not return error any more

Got it. I will fix it later.

wangkuiyi · 2017-07-05T07:03:50Z

doc/design/context.md

+
+At the same time, some computation work will executed by cuBLAS or cuDNN library. Take cuBLAS library as an example, we have to acquire a cublasHandle which binds on a CudaStream to make computation. It's the same way as Eigen library does.
+
+`DeviceContext` is defined as follows(just using for unify class `CudaDeviceContext` and class `CpuDeviceContext`):


If DeviceContext has multiple sub-classes, it implies that these sub-classes share some common behaviors (or methods). So it looks strange that the following definition of DeviceContext doesn't have virtual methods other than the destructor.

Yes, we need stream, cublasHandle, cudnnHandle in GPU, but nearly nothing in CPU.
So, CUPDeviceContext and CUDADeviceContext can hardly have something in common.
The DeviceContext is just used for unify two types and make it convenient passing parameter to Operator.

In caffe2, except for memory New and Delete,CUPContext and CUDAContext nearly have no methods in common. (https://github.com/caffe2/caffe2/blob/master/caffe2/core/context.h#L124)
In mxnet, Stream<cpu> and Stream<gpu> nearly have no methods in common too. The Stream in mxnet holds runtime resoureces. (https://github.com/dmlc/mshadow/blob/20b54f068c1035f0319fa5e5bbfb129c450a5256/mshadow/tensor.h#L370)

wangkuiyi · 2017-07-05T07:06:17Z

doc/design/context.md

+  }
+
+  ~DeviceGuard() {
+    cudaError_t err = cudaSetDevice(previous_.device);


It seems that we can write:

PADDLE_ENFORCE(cudaSetDevice(previous_.device));

Yes, have fixed it in PR #2709.

wangkuiyi · 2017-07-05T07:09:55Z

doc/design/context.md

+    eigen_handle_ = new Eigen::GpuDevice(eigen_stream_);    
+  }
+
+  void Wait() {


This two-line function is only called by the destructor, how about move these two lines into the destructor and save this function?

The Wait method is to synchronize a CUDA stream. And this method may be called also in other circumstances.

wangkuiyi · 2017-07-05T07:20:02Z

doc/design/context.md

+  GPUPlace previous_;
+};
+
+class CudaDeviceContext : public DeviceContext{


Cuda => CUDA. It is an acronym.

wangkuiyi · 2017-07-05T07:20:11Z

doc/design/context.md

+
+### CpuDeviceContext
+
+CpuDeviceContext is defined as follows:


wangkuiyi · 2017-07-05T07:29:33Z

doc/design/context.md

+};
+```
+
+The `Run` method will take `Variable` (containing `Tensor`) from `Scope` to make computation on certain `DeviceContext`. `DeviceContext` provides necessary runtime resources for computation, including CudaStream, cublasHandle and so on.


Is it something like Operator::Run can call Operator::Input(i) to get the i-th input, which should have been created by the feeding operator or some other one before the invocation of Operator::Run?

If so, it seems that Operator::Run has to run on the same device as where the input resides; otherwise, there would be unnecessary data copying from where the input has been to where Run is on.

Given that we don't want such kind of inefficient copying, we should make sure that all operators of a network run on the same device, and Operator::Run should take a constant reference to the context, like

void Operator::Run(const Context& ctx);

and Context should refer to a single specific device (like GPU) other than a set of GPUs?

I think we can do so actually, just class Net a simple lightweight class without sub-classes. If we want to make use of multiple GPUs, we can create class DataParallelEngine which runs multiple clones of a net, each on a GPU and by a thread, and aggregates gradient tensors of model parameters. A simpler engine could be class SingleThreadEngine which runs a net on a specific device using one thread.

At first, you are right, the Operator::Run will run on the same device.

However, the Context which binds on a specific GPU card is not enough for efficient Net execution. The Operator::Run usually binds on a specific CUDA stream on a specific GPU card.

If we just want to implement a SimpleNet, which all the operators will execute sequentially, only one CUDA stream is enough.

But, if we also want to implement a DAGNet, which the operators can execute in parallel, we must have to create several CUDA stream on a GPU card.

So, we may have a single thread or several threads to execute an Net, and each thread will hold it own CUDA stream.

There will be two ways to implement DAGNet.

We pass a Context parameter to the Operator, and the Context must have several CUDA stream.

struct Context { int device_id; std::vector<cudaStream_t> streams; }; void Operator::Run(const Context& ctx);

And the Operator have to choose one CUDA stream to run.

We pass OpContext to Operator

struct OpContext { int device_id; cudaStream_t stream; }; void Operator::Run(OpContext& ctx);

And the OpContext will be passed to the Operator, and may be created by many threads. And all the threads are managed and scheduled by engine.

there are two ways to get input when run an operator:

caffe use

Operator::Input(i)

tf use

void Compute(OpKernelContext* context) override { const Tensor& input = context->input(0); const Tensor& bias = context->input(1); }

My point is exactly that we might not need DAGNet.

So, we may implement a DataParallelEngine which runs on multi-GPUs. And, I think that in a single GPU card, we also have to implement both SingleThreadEngine and DAGEngine.

Since DAGEngine will created many threads, and each will bind on a specific CUDA stream. I suggest we take OpContext as the parameter of Operator::Run method.

I am writing a simple engine according to mxnet::engine's design document in spare time.
https://github.com/Superjom/NaiveEngine
It is nearly finished, and I am writing more UT to make sure that it works in large throughput.

Currently, a DebugEngine is an engine without thread pool, a MultiThreadEnginePooled with a thread pool of only one worker is the case:

A simpler engine could be class SingleThreadEngine which runs a net on a specific device using one thread.

The code is much less than original mxnet's, and logic simpler

Just add a choice :-) @wangkuiyi @reyoung

At now, we may not need DAGNet.

But the Run method is actually executed on a specific CUDA stream. I think pass a CUDA stream to Operator is more flexible.

I'd suggest that let's target at SingleThreadEngine if it's too much discussion would take too much time that we cannot afford. Let's assume that a net runs on a single GPU/CPU by one thread only. We can keep this design for later upgrading to more complicated engines.

wangkuiyi · 2017-07-05T07:35:44Z

doc/design/context.md

+
+
+```c++                                   
+class DeviceGuard {


If we restrict that a net runs on a single device, and DataParallelEngine makes use of multiple GPUs, does that mean that device switching could be defined inside of DataParallelEngine, or even its sub-class MultiCUDAEngine?

DeviceGuard is just use for reduce the memory burden of developers. The DeviceGuard ensure we call some CUDA API in right GPU device.
If not, the developer have to switch to right GPU device before call some CUDA API.

QiJune · 2017-08-02T02:37:39Z

Done.

add context design doc

f6b629f

QiJune requested review from hedaoyuan, reyoung, jacquesqiao, qingqing01 and Superjomn June 28, 2017 09:45

reyoung reviewed Jun 28, 2017

View reviewed changes

jacquesqiao reviewed Jun 28, 2017

View reviewed changes

hedaoyuan reviewed Jun 28, 2017

View reviewed changes

Superjomn reviewed Jun 29, 2017

View reviewed changes

follow comments

f1b5442

QiJune requested a review from wangkuiyi June 29, 2017 08:50

QiJune mentioned this pull request Jun 29, 2017

net design with NetBuilder #2598

Merged

reyoung reviewed Jul 1, 2017

View reviewed changes

follow comments, rename Context to DeviceContext

d41ebd7

QiJune mentioned this pull request Jul 3, 2017

implement DeviceContext #2709

Merged

QiJune changed the title ~~add context design doc~~ add DeviceContext design doc Jul 5, 2017

QiJune mentioned this pull request Jul 5, 2017

A very preliminary draft #2696

Closed

wangkuiyi reviewed Jul 5, 2017

View reviewed changes

QiJune closed this Aug 2, 2017


		The future DAGNet is run by multi-threads. And each thread will have its own Eigen::GpuDevice object binding on different CudaStream. Multi-threads can run parallelly on a same GPU card.

		And Copy(Communication) work will be in charge of specific thread. The copy thread will only get CudaStream from corresponding Context.

		## Context Design


		A Net is executed by single or several threads. A Context is related to a thread and records necessary runtime resources.


		### CUDAContext

		Because the Tensor computation are executed by Eigen library, which needs an Eigen::GpuDevice type object as parameter. And the GpuDevice parameter is constructed with an Eigen::CudaStreamDevice object. We need to set a specific GpuID and CudaStream to create a Eigen::CudaStream object.


		At the same time, some computation work will executed by cublas or cudnn library. Take cublas library as an example, we have to acquire a cublasHandle which binds on a CudaStream to make computation. It's the same way as Eigen library does.

		The future DAGNet is run by multi-threads. And each thread will have its own Eigen::GpuDevice object binding on different CudaStream. Multi-threads can run parallelly on a same GPU card.


		At the same time, some computation work will executed by cuBLAS or cuDNN library. Take cuBLAS library as an example, we have to acquire a cublasHandle which binds on a CudaStream to make computation. It's the same way as Eigen library does.

		A `Context` is corresponded to a thread and holds runtime resources, such as CudaStream, cublasHandle and so on. And the `Run` method of `Operator` will take `Context` from a specific thread as a parameter to get necessary runtime resources.


		At the same time, some computation work will executed by cuBLAS or cuDNN library. Take cuBLAS library as an example, we have to acquire a cublasHandle which binds on a CudaStream to make computation. It's the same way as Eigen library does.

		`DeviceContext` is defined as follows(just using for unify class `CudaDeviceContext` and class `CpuDeviceContext`):

add DeviceContext design doc #2648

add DeviceContext design doc #2648

Conversation

QiJune commented Jun 28, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacquesqiao Jun 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wangkuiyi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wangkuiyi Jul 5, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

QiJune Jul 5, 2017 • edited Loading

Choose a reason for hiding this comment

QiJune commented Aug 2, 2017

jacquesqiao Jun 28, 2017 •

edited

Loading

wangkuiyi Jul 5, 2017 •

edited

Loading

QiJune Jul 5, 2017 •

edited

Loading