Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add DeviceContext design doc #2648

Closed
wants to merge 3 commits into from

Conversation

QiJune
Copy link
Member

@QiJune QiJune commented Jun 28, 2017


~DeviceGuard() noexcept {
cudaError_t err = cudaSetDevice(previous_);
PADDLE_ASSERT(err == cudaSuccess);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is obviously not noexcept.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


The future DAGNet is run by multi-threads. And each thread will have its own Eigen::GpuDevice object binding on different CudaStream. Multi-threads can run parallelly on a same GPU card.

And Copy(Communication) work will be in charge of specific thread. The copy thread will only get CudaStream from corresponding Context.
Copy link
Member

@jacquesqiao jacquesqiao Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not quite understand about this line, is it
specific thread is in charge of Copy(Communication) work?
and what is Copy(Communication) work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, specific thread is in charge of of Copy(Communication) work is right.
It means the data copy between cpu/gpu or between gpus.

## Context Design


A Net is executed by single or several threads. A Context is related to a thread and records necessary runtime resources.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no explanation here why should Context, should point out the relationship between Context and Operator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return blas_handle_;
}

cudnnHandle_t GetDnnHandle() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe GetCUDNNHandle is more clear.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


~CUDAContext() {
Wait();
cudaError_t err = cudaStreamDestroy(stream_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the destructor of the stream should be placed at the end. Because this is the first to be constructed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

## Context Design


A Net is executed by single or several threads. A Context is related to a thread and records necessary runtime resources.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

records -> holds? sounds weird

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Context is defined as follows:

```
class Context {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

struct Context;
may be better, the Context is a lightweight data structure, users access its member data, without rich member functions needed. So a struct is enough.

reference mxnet::Context

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Context defined here is more like RunContext in mxnet


### CUDAContext

Because the Tensor computation are executed by Eigen library, which needs an Eigen::GpuDevice type object as parameter. And the GpuDevice parameter is constructed with an Eigen::CudaStreamDevice object. We need to set a specific GpuID and CudaStream to create a Eigen::CudaStream object.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> which needs an Eigen::GpuDevice type object as a parameter

. And -> , and
. We -> , we

We need to set both a specific GpuID and an CudaStream to create a Eigen::CudaStream object.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


At the same time, some computation work will executed by cublas or cudnn library. Take cublas library as an example, we have to acquire a cublasHandle which binds on a CudaStream to make computation. It's the same way as Eigen library does.

The future DAGNet is run by multi-threads. And each thread will have its own Eigen::GpuDevice object binding on different CudaStream. Multi-threads can run parallelly on a same GPU card.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all ". And" -> ", and"

on different CudaStream. Multi-threads can run parallelly on a same GPU card.
->
on different CudaStream**, so that** multi-threads can run parallelly on the same GPU card.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


- Differnet GPU cards have different GpuID, and we can do data parallelism on multi-GPUs.
- Multi-threads can run a Net parallelly on a single GPU card, and each thread has one Context.
- There is also single thread executing a Net sequentially. All computation and communication work will use same Context.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use same -> use the same

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Context is defined as follows:

```
class Context {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no common methods shared between GpuContext and CpuContext?

struct Context {
    int dev_id{0}; // CPU = 0
    enum DevType {
        kCPU;
        kGPU;
    };
   DevType dev_type;
   
   enum StreamType {
       kCUDNN;
       kBLAS;
       kCUDA;
   };
  // all the streams are created globally, so a void* is enough
  // idx is the index of stream of `type`, each thread in a device can have one idx.
  void* GetStream(StreamType type, int stream_idx);
  // allocate one stream for each StreamType and thread, in all the devices.
  static void** streams;
};

Both CpuContext and GpuContext are Context, so it's wired that they have no similarity.

And we can merge them into one Context , really need two?
If the CpuContext and GpuContext will be used as a template parameter, can Place repalce their role ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Context is just for unifying two class CUDAContext and CPUContext. And RunContext in mxnet is also for unifying two class Stream<gpu> and Stream<cpu>.

@QiJune QiJune requested a review from wangkuiyi June 29, 2017 08:50

At the same time, some computation work will executed by cuBLAS or cuDNN library. Take cuBLAS library as an example, we have to acquire a cublasHandle which binds on a CudaStream to make computation. It's the same way as Eigen library does.

A `Context` is corresponded to a thread and holds runtime resources, such as CudaStream, cublasHandle and so on. And the `Run` method of `Operator` will take `Context` from a specific thread as a parameter to get necessary runtime resources.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a Context must corresponded to a thread?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Context here is actually thread Context. The Net is executed by single or several threads, and each thread will have its own runtime resources. Maybe ThreadContext is a better name to understand.

Context is defined as follows(just using for unify class CUDAContext and class CPUContext):

```
class Context {};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, how do we use CPUContext or GPUContext from a Context*?

If we want to use dynamic_cast<GPUContext*>(context);, Context shoud have a virtual destructor.

struct Context {
  virtual ~Context() {}
};

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. Fix it later

```c++
class DeviceGuard {
public:
explicit DeviceGuard(int newDevice)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should DeviceGuard take Place as the argument? Like

class GPUPlaceGuard {
public:
  explicit GPUPlaceGuard(GPUPlace place);
};

It could make our design consistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@QiJune QiJune mentioned this pull request Jul 3, 2017
@QiJune QiJune changed the title add context design doc add DeviceContext design doc Jul 5, 2017
@QiJune QiJune mentioned this pull request Jul 5, 2017
Copy link
Collaborator

@wangkuiyi wangkuiyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR.

I am wondering about one basic question -- could we restrict to run a net on a single device, and to make use of multiple devices by introducing class DataParallelEngine?

It seems that if we do so, we can simplify a lot of things, at least for right now?

`Net` is the container and controller of a set of `Operator`. Each `Operator` in `Net` has a method called `Run` to make computation. The `Run` method of `Operator` is defined as follows:

```c++
Error Operator::Run(OpContext* context);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of errors could Operator::Run return? What actions could the caller take in response to these errors?

I ask because I always tend to handle errors inside the function instead of leaving errors to the caller, who has less information as the callee does and usually cannot do much with the returned errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Run interface of operator is changed and do not return error any more

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I will fix it later.


At the same time, some computation work will executed by cuBLAS or cuDNN library. Take cuBLAS library as an example, we have to acquire a cublasHandle which binds on a CudaStream to make computation. It's the same way as Eigen library does.

`DeviceContext` is defined as follows(just using for unify class `CudaDeviceContext` and class `CpuDeviceContext`):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If DeviceContext has multiple sub-classes, it implies that these sub-classes share some common behaviors (or methods). So it looks strange that the following definition of DeviceContext doesn't have virtual methods other than the destructor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need stream, cublasHandle, cudnnHandle in GPU, but nearly nothing in CPU.
So, CUPDeviceContext and CUDADeviceContext can hardly have something in common.
The DeviceContext is just used for unify two types and make it convenient passing parameter to Operator.

In caffe2, except for memory New and Delete,CUPContext and CUDAContext nearly have no methods in common. (https://github.com/caffe2/caffe2/blob/master/caffe2/core/context.h#L124)
In mxnet, Stream<cpu> and Stream<gpu> nearly have no methods in common too. The Stream in mxnet holds runtime resoureces. (https://github.com/dmlc/mshadow/blob/20b54f068c1035f0319fa5e5bbfb129c450a5256/mshadow/tensor.h#L370)

}

~DeviceGuard() {
cudaError_t err = cudaSetDevice(previous_.device);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we can write:

PADDLE_ENFORCE(cudaSetDevice(previous_.device));

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, have fixed it in PR #2709.

eigen_handle_ = new Eigen::GpuDevice(eigen_stream_);
}

void Wait() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This two-line function is only called by the destructor, how about move these two lines into the destructor and save this function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Wait method is to synchronize a CUDA stream. And this method may be called also in other circumstances.

GPUPlace previous_;
};

class CudaDeviceContext : public DeviceContext{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cuda => CUDA. It is an acronym.


### CpuDeviceContext

CpuDeviceContext is defined as follows:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cpu => CPU

};
```

The `Run` method will take `Variable` (containing `Tensor`) from `Scope` to make computation on certain `DeviceContext`. `DeviceContext` provides necessary runtime resources for computation, including CudaStream, cublasHandle and so on.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it something like Operator::Run can call Operator::Input(i) to get the i-th input, which should have been created by the feeding operator or some other one before the invocation of Operator::Run?

If so, it seems that Operator::Run has to run on the same device as where the input resides; otherwise, there would be unnecessary data copying from where the input has been to where Run is on.

Given that we don't want such kind of inefficient copying, we should make sure that all operators of a network run on the same device, and Operator::Run should take a constant reference to the context, like

void Operator::Run(const Context& ctx);

and Context should refer to a single specific device (like GPU) other than a set of GPUs?

I think we can do so actually, just class Net a simple lightweight class without sub-classes. If we want to make use of multiple GPUs, we can create class DataParallelEngine which runs multiple clones of a net, each on a GPU and by a thread, and aggregates gradient tensors of model parameters. A simpler engine could be class SingleThreadEngine which runs a net on a specific device using one thread.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first, you are right, the Operator::Run will run on the same device.

However, the Context which binds on a specific GPU card is not enough for efficient Net execution. The Operator::Run usually binds on a specific CUDA stream on a specific GPU card.

If we just want to implement a SimpleNet, which all the operators will execute sequentially, only one CUDA stream is enough.

But, if we also want to implement a DAGNet, which the operators can execute in parallel, we must have to create several CUDA stream on a GPU card.

So, we may have a single thread or several threads to execute an Net, and each thread will hold it own CUDA stream.

There will be two ways to implement DAGNet.

  • We pass a Context parameter to the Operator, and the Context must have several CUDA stream.
struct Context {
  int device_id;
  std::vector<cudaStream_t> streams;
};

void Operator::Run(const Context& ctx);

And the Operator have to choose one CUDA stream to run.

  • We pass OpContext to Operator
struct OpContext {
  int device_id;
  cudaStream_t stream;
};

void Operator::Run(OpContext& ctx);

And the OpContext will be passed to the Operator, and may be created by many threads. And all the threads are managed and scheduled by engine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are two ways to get input when run an operator:

  1. caffe use
Operator::Input(i)
  1. tf use
  void Compute(OpKernelContext* context) override {
    const Tensor& input = context->input(0);
    const Tensor& bias = context->input(1);
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is exactly that we might not need DAGNet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we may implement a DataParallelEngine which runs on multi-GPUs. And, I think that in a single GPU card, we also have to implement both SingleThreadEngine and DAGEngine.

Since DAGEngine will created many threads, and each will bind on a specific CUDA stream. I suggest we take OpContext as the parameter of Operator::Run method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am writing a simple engine according to mxnet::engine's design document in spare time.
https://github.com/Superjom/NaiveEngine
It is nearly finished, and I am writing more UT to make sure that it works in large throughput.

Currently, a DebugEngine is an engine without thread pool, a MultiThreadEnginePooled with a thread pool of only one worker is the case:

A simpler engine could be class SingleThreadEngine which runs a net on a specific device using one thread.

The code is much less than original mxnet's, and logic simpler

Just add a choice :-) @wangkuiyi @reyoung

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At now, we may not need DAGNet.

But the Run method is actually executed on a specific CUDA stream. I think pass a CUDA stream to Operator is more flexible.

Copy link
Collaborator

@wangkuiyi wangkuiyi Jul 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest that let's target at SingleThreadEngine if it's too much discussion would take too much time that we cannot afford. Let's assume that a net runs on a single GPU/CPU by one thread only. We can keep this design for later upgrading to more complicated engines.



```c++
class DeviceGuard {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we restrict that a net runs on a single device, and DataParallelEngine makes use of multiple GPUs, does that mean that device switching could be defined inside of DataParallelEngine, or even its sub-class MultiCUDAEngine?

Copy link
Member Author

@QiJune QiJune Jul 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeviceGuard is just use for reduce the memory burden of developers. The DeviceGuard ensure we call some CUDA API in right GPU device.
If not, the developer have to switch to right GPU device before call some CUDA API.

@QiJune
Copy link
Member Author

QiJune commented Aug 2, 2017

Done.

@QiJune QiJune closed this Aug 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants