Skip to content

A very preliminary draft #2696

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
205 changes: 205 additions & 0 deletions paddle/framework/framework.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,205 @@
# Design Doc: Framework

This design is after the learning of other deep learning systems and numerous group discussions.

## Operation and Operator

At the core of a deep learning system is the neural network. A neural network is a directed graph of operations, where each operation is an instance of an *operator*, which is a C++ class with a `Run` method derived from the base class `Operator`.

### `OperationProto`

Users are supposed to describe neural networks by calling Python. functions known as *operation creators*, which, in turn call a C++ function `paddle::framework::CreateOperation`. This C++ function eases the adding of more language bindings.

Python and C++ have different function call syntax, so we define the parameter of `paddle::framework::CreateOperation` a protobuf message `OperationProto`.

### `OperatorProto`

We'd like to generate operator creators automatically from C++ code so to keep Python code always updated. So we need to describe each C++ class in a protobuf message `OperatorProto`. We also need to fill in an `OperatorProto` message for each operator class and expose these messages to Python function `paddle.framework.create_operation_creators`. We call this filling and exposing mechanism *operator registration*.


## Operators, Layers, Variables, Scope

### Operators as Functions

An operator is intrinsically a function, which has inputs and outputs. For example, the functional representation of the GEMM operator is
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe InferShape is also need?


```cpp
gemm(X, W, scale, act=LeRU) {
unactivated = scale * X * W
if !act {
return unactivated
}
return unactivated, act(unactivated, cap)
}
```

Note that operators might call other operators. In above example, `gemm` calls `act`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a better example for an Op call other Ops is RnnOp. :-) Because normal gemm doesn't contain activation.


### Gradient Operators

Each operator has a corresponding gradient operator that defines the gradient computation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One forward op is not only corresponding to one gradient operator but gradient operators.

For example, FcOp has three inputs. Not all of them needs gradient in practice. It could implement each input corresponding to one gradient operator, so FcOp could have three gradient ops.

More generally, Caffe2 and MXNet return an array of ops as one op's gradient ops.


### Layers

If we describe a neural network by calling operator creators directly, our code would be lengthy. It is easier to use layers creators, which, in addition to calling operator creators for the computation, creates model parameters.

For example, `paddle.layer.fc` and `paddle.layer.conv` both call `paddle.op.gemm`. They should also create and initialize `W`, the layer's parameter, and `act` and `scale`, the attributes.

### Variables

We prefer to represent inputs, outputs, and attributes in Variables, so an operation's output can be used to feed and to configure other operations.

Variables should be able to hold the following types of values:

- int
- bool
- float
- half
- string
- tensor
- operator
- scope

Some Python API proposals:

```python
x = paddle.variable.new("x") # create a variable of not yet known type
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This APIs could be unified by one paddle.variable.new?

x_var = paddle.variable.new("x", 10)

y = paddle.variable.Tensor()
y_var = paddle.variable.new("y", y)

z = paddle.variable.new("z", "A string")

no_name = paddle.variable.new(name=None, val=100)  # actually I think each variable should have a name. It might have some independent scope to hold them.

x = paddle.variable.tensor("x") # create a tensor typed varible without value
x = paddle.variable.int("x", 10) # create an unnamed int varaible and set to 10
Copy link
Collaborator

@reyoung reyoung Jul 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that variable unnamed or typo? Maybe x = paddle.variable.int(10) ?
I cannot find why it is necessary for a variable can be unnamed?

Copy link
Collaborator

@reyoung reyoung Jul 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe an independent Scope could represent all attribute variables of an operator?

Like here I proposed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see that attribute is so special and cannot be represented by a Variable.

x = paddle.variable.operation("x", paddle.operator.gemm(...)) # an operation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why a variable can be an operation? Is that need for RnnOp?

x = paddle.variable.tensor("x",
numpy.random.randn(200, 100), # set the value of the tensor
estimated = true) # will be updated by the backward algorithm.
Copy link
Collaborator

@reyoung reyoung Jul 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

estimated is not very straight forward name. Maybe need_backward or requires_grad is better? Not all variables that need gradient will be updated, only parameters do.

x.estimated = false # prevents from update.
```

Note that the variable has the following methods:

```cpp
class Variable {
public:
bool Estimated() const;
bool SetEstimated(bool);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the meaning of this return bool?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can make it just void.

template <typename T> const T& Get() const;
template <typename T> T* GetMutable();
template <typename T> bool IsType(T) const;
};
```

`Get` and `GetMutable` implements *lazy memory allocation*, as described in the [Variable design doc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/variable.md).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe only GetMutable implements lazy memory allocation. Get should raise an error if that variable is not created?


Note that `name` is not a property of a variable. A variable can have various names in different scopes.

### Scope

In programming languages, variables belong to scopes. Scopes enable the release of local variables when a stack frame pops.

A neural network is arguably equivalent to a program. The *recurrent operator*, `paddle::operator::Recurrent`, is like `for` loop, and the conditional operator, `paddle::operator::Conditional`, is like `if/switch`. They can have sub-networks as attributes, and `paddle::operator::Recurrent/Conditional::Run` creates a local scope before executing the sub-network. At inference time, once `paddle::operator::Recurrent/Conditional::Run` frees the scope before completes. At training time, it is the corresponding gradient operations' `paddle::operator::RecurrentGrad/ConditionalGrad::Run` who free the local scope.

The global and nested local scopes form a hierarchy. The following Python functions make it convenient to program scopes:

1. `paddle.scope.current()` returns the current scope, which defaults to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this line which defaults to not complete?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that it defaults to the value returned by the next (second) line. I will rewrite it to make it clear.

1. `paddle.scope.global()`, which returns the top-level scope.

C++ code shouldn't maintain global status like the current scope to prevent unexpected inconsistancy.

```cpp
class Scope {
public:
Scope() : parent_(nullptr) {} // Constructor creates only global scopes.

Variable* FindVar(std::string name); // Find in the hierrachy or return null.
Variable* CreateVar(StringPiece name, Variable*); // Find or create.

Scope* CreateScope(StringPiece name); // Finds or creates a sub-scope.
void DeleteScope(Scope*); // Delete a sub-scope or raise exception

private:
std::map<std::string /*name*/, std::unique_ptr<Variable> > vars_;
std::shared_ptr<Scope> parent_;
std::vector<Scope*> children_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need children_ of a scope?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug printing is the only usage in my mind.


Mutex mutex_; // Make this class thread-safe.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems reasonable, maybe I should update Scope design and add mutex today.

};
```

## Execution and Context

A neural network is a program. Training or inference is to execute it. The runtime environment of execution is known as a *context*:

1. a scope,
1. device(s), or places,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the only device for GPU is not enough. It could be a DeviceContext for each GPU which holds

  • computation stream
  • handles for cuDNN/cuBLAS etc.

1. a flag indicates if we are training and should we create gradient operations and run backward

and can be defined as

```cpp
struct Context {
Scope* scope_;
std::vector<Place> places_; // a network might run on multiple devices.
bool training_;
};
```
Copy link
Member

@QiJune QiJune Jul 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the context of Operator's Run method is different from Net's Run method.
The Net can run on multi-devices, but the operator can only run on a specific device.
So, the operator may need a OpContext to Run.

Copy link
Collaborator Author

@wangkuiyi wangkuiyi Jul 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be frank I am not even sure that a net should be able to run on multiple devices. It seems that we can use multiple devices by doing data parallelism -- each device runs only one copy of the net, and we can do gradient aggregation using NCCL. In this way, it seems that both net and operators need just one device.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that both net and operators need just one device.

Is there a higher level concept to run NCCL in C++ or just let Python run NCCL? If that concept in CPP, it seems that is also a Network.

class MultiDeviceNetwork {
private:
  // holds networks on each device.
  vector<Network> networks_;
}

However, I suggest that we should only concern about the single device in basic Network. It is easy to change a single device Network into multiple device Network by using NCCL.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to what I understand now, I'd avoid running a network on multiple devices, because the maintenance of multiple CUDA streams means a very complicated CUDAContext or OpKernelContext or something like that.

And it seems that we can make use of multiple devices using data parallelism, which requires only running a net on a single device.


The Python API `paddle.train` can prepare a context before it calls C++ code `Operator::Run(const Context&)`.

As an example, `paddle::operator::Recurrent::Run(const Context& ctx)` can then create a new scope by calling `ctx.CreateScope`, and run the step-net with a new context around the new scope:

```cpp
class Recurrent {
public:
void Run(const Context& ctx) {
auto subscope = ctx.scope_.CreateScope("");
step_net_.Run(Context{subscope, ctx.places_, ctx.training_});
if (!ctx.training_) {
ctx.scope_.DeleteScope(subscope);
}
};
```

Another example is that the Gemm operator need to create a tensor on `Context::places_[0]` and assign the tensor to its output variable:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about data parallelism on multi-GPUs? I think that the places_[0] should not be used inside a operator's run method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, we must specify one place to aggregate data from multiple GPUs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reyoung suggested defining the aggregation as an operator. I am not sure. It seems much easier if a net runs on a single device as in #2696 (comment).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @QiJune on #2696 (comment) -- using data parallelism and only places_[0] would be enough.


```cpp
class Gemm {
public:
void Run(const Context& ctx) {
if (paddle::platform::IsGPUPlace(ctx.places_[0])) {
cuDNNGemm(
Output(0).mutable_data<float>(ctx.places_[0], DerivedSizeFromInputs()),
...);
Copy link
Member

@QiJune QiJune Jul 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here should be cublasGemm.
And cublasGemm needs to acquire a cublasHandle to finish computation.
And nearly all the computation in Run method need to acquire a eigen device.
Please refer to #2648

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I am reading and reviewing your design.

} else {
mkl::sgemm(
Output(0).mutable_data<float>(ctx.places_[0], DerivedSizeFromInputs()),
...);
}
}
};
```

### Place

A place indicates a device and its type. We have the following place definitions:

```cpp
struct GPUPlace {
int device_; // GPU id.
};

struct CPUPlace {
enum Type {
X86,
ARM5,
ARM6,
...
};
Type type_;
};
```

We can add more Place implementations, like FPGAPlace and XeonPhiPlace, in the future.

### Gradient Operators

A gradient operator should be build and linked only if we are building a binary that supports training. If we are building an "inference-only" binary, we shouldn't link gradient operators.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gradient Operator could also be a forward operator.

For example, operator mul's gradient operator is also mul.


Gradient operations should be created only if we are going to train a neural network.