-
Notifications
You must be signed in to change notification settings - Fork 5.7k
A very preliminary draft #2696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A very preliminary draft #2696
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,205 @@ | ||
# Design Doc: Framework | ||
|
||
This design is after the learning of other deep learning systems and numerous group discussions. | ||
|
||
## Operation and Operator | ||
|
||
At the core of a deep learning system is the neural network. A neural network is a directed graph of operations, where each operation is an instance of an *operator*, which is a C++ class with a `Run` method derived from the base class `Operator`. | ||
|
||
### `OperationProto` | ||
|
||
Users are supposed to describe neural networks by calling Python. functions known as *operation creators*, which, in turn call a C++ function `paddle::framework::CreateOperation`. This C++ function eases the adding of more language bindings. | ||
|
||
Python and C++ have different function call syntax, so we define the parameter of `paddle::framework::CreateOperation` a protobuf message `OperationProto`. | ||
|
||
### `OperatorProto` | ||
|
||
We'd like to generate operator creators automatically from C++ code so to keep Python code always updated. So we need to describe each C++ class in a protobuf message `OperatorProto`. We also need to fill in an `OperatorProto` message for each operator class and expose these messages to Python function `paddle.framework.create_operation_creators`. We call this filling and exposing mechanism *operator registration*. | ||
|
||
|
||
## Operators, Layers, Variables, Scope | ||
|
||
### Operators as Functions | ||
|
||
An operator is intrinsically a function, which has inputs and outputs. For example, the functional representation of the GEMM operator is | ||
|
||
```cpp | ||
gemm(X, W, scale, act=LeRU) { | ||
unactivated = scale * X * W | ||
if !act { | ||
return unactivated | ||
} | ||
return unactivated, act(unactivated, cap) | ||
} | ||
``` | ||
|
||
Note that operators might call other operators. In above example, `gemm` calls `act`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe a better example for an Op call other Ops is |
||
|
||
### Gradient Operators | ||
|
||
Each operator has a corresponding gradient operator that defines the gradient computation. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One forward op is not only corresponding to one gradient operator but gradient operators. For example, More generally, Caffe2 and MXNet return an array of ops as one op's gradient ops. |
||
|
||
### Layers | ||
|
||
If we describe a neural network by calling operator creators directly, our code would be lengthy. It is easier to use layers creators, which, in addition to calling operator creators for the computation, creates model parameters. | ||
|
||
For example, `paddle.layer.fc` and `paddle.layer.conv` both call `paddle.op.gemm`. They should also create and initialize `W`, the layer's parameter, and `act` and `scale`, the attributes. | ||
|
||
### Variables | ||
|
||
We prefer to represent inputs, outputs, and attributes in Variables, so an operation's output can be used to feed and to configure other operations. | ||
|
||
Variables should be able to hold the following types of values: | ||
|
||
- int | ||
- bool | ||
- float | ||
- half | ||
- string | ||
- tensor | ||
- operator | ||
- scope | ||
|
||
Some Python API proposals: | ||
|
||
```python | ||
x = paddle.variable.new("x") # create a variable of not yet known type | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This APIs could be unified by one x_var = paddle.variable.new("x", 10)
y = paddle.variable.Tensor()
y_var = paddle.variable.new("y", y)
z = paddle.variable.new("z", "A string")
no_name = paddle.variable.new(name=None, val=100) # actually I think each variable should have a name. It might have some independent scope to hold them. |
||
x = paddle.variable.tensor("x") # create a tensor typed varible without value | ||
x = paddle.variable.int("x", 10) # create an unnamed int varaible and set to 10 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is that variable There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think maybe an independent Scope could represent all attribute variables of an operator? Like here I proposed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't see that attribute is so special and cannot be represented by a Variable. |
||
x = paddle.variable.operation("x", paddle.operator.gemm(...)) # an operation | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, why a variable can be an operation? Is that need for RnnOp? |
||
x = paddle.variable.tensor("x", | ||
numpy.random.randn(200, 100), # set the value of the tensor | ||
estimated = true) # will be updated by the backward algorithm. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
x.estimated = false # prevents from update. | ||
``` | ||
|
||
Note that the variable has the following methods: | ||
|
||
```cpp | ||
class Variable { | ||
public: | ||
bool Estimated() const; | ||
bool SetEstimated(bool); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what is the meaning of this return bool? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we can make it just void. |
||
template <typename T> const T& Get() const; | ||
template <typename T> T* GetMutable(); | ||
template <typename T> bool IsType(T) const; | ||
}; | ||
``` | ||
|
||
`Get` and `GetMutable` implements *lazy memory allocation*, as described in the [Variable design doc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/variable.md). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe only |
||
|
||
Note that `name` is not a property of a variable. A variable can have various names in different scopes. | ||
|
||
### Scope | ||
|
||
In programming languages, variables belong to scopes. Scopes enable the release of local variables when a stack frame pops. | ||
|
||
A neural network is arguably equivalent to a program. The *recurrent operator*, `paddle::operator::Recurrent`, is like `for` loop, and the conditional operator, `paddle::operator::Conditional`, is like `if/switch`. They can have sub-networks as attributes, and `paddle::operator::Recurrent/Conditional::Run` creates a local scope before executing the sub-network. At inference time, once `paddle::operator::Recurrent/Conditional::Run` frees the scope before completes. At training time, it is the corresponding gradient operations' `paddle::operator::RecurrentGrad/ConditionalGrad::Run` who free the local scope. | ||
|
||
The global and nested local scopes form a hierarchy. The following Python functions make it convenient to program scopes: | ||
|
||
1. `paddle.scope.current()` returns the current scope, which defaults to | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this line There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I meant that it defaults to the value returned by the next (second) line. I will rewrite it to make it clear. |
||
1. `paddle.scope.global()`, which returns the top-level scope. | ||
|
||
C++ code shouldn't maintain global status like the current scope to prevent unexpected inconsistancy. | ||
|
||
```cpp | ||
class Scope { | ||
public: | ||
Scope() : parent_(nullptr) {} // Constructor creates only global scopes. | ||
|
||
Variable* FindVar(std::string name); // Find in the hierrachy or return null. | ||
Variable* CreateVar(StringPiece name, Variable*); // Find or create. | ||
|
||
Scope* CreateScope(StringPiece name); // Finds or creates a sub-scope. | ||
void DeleteScope(Scope*); // Delete a sub-scope or raise exception | ||
|
||
private: | ||
std::map<std::string /*name*/, std::unique_ptr<Variable> > vars_; | ||
std::shared_ptr<Scope> parent_; | ||
std::vector<Scope*> children_; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why we need children_ of a scope? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Debug printing is the only usage in my mind. |
||
|
||
Mutex mutex_; // Make this class thread-safe. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems reasonable, maybe I should update Scope design and add mutex today. |
||
}; | ||
``` | ||
|
||
## Execution and Context | ||
|
||
A neural network is a program. Training or inference is to execute it. The runtime environment of execution is known as a *context*: | ||
|
||
1. a scope, | ||
1. device(s), or places, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe the only device for GPU is not enough. It could be a
|
||
1. a flag indicates if we are training and should we create gradient operations and run backward | ||
|
||
and can be defined as | ||
|
||
```cpp | ||
struct Context { | ||
Scope* scope_; | ||
std::vector<Place> places_; // a network might run on multiple devices. | ||
bool training_; | ||
}; | ||
``` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that the context of Operator's Run method is different from Net's Run method. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To be frank I am not even sure that a net should be able to run on multiple devices. It seems that we can use multiple devices by doing data parallelism -- each device runs only one copy of the net, and we can do gradient aggregation using NCCL. In this way, it seems that both net and operators need just one device. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Is there a higher level concept to run class MultiDeviceNetwork {
private:
// holds networks on each device.
vector<Network> networks_;
} However, I suggest that we should only concern about the single device in basic There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Up to what I understand now, I'd avoid running a network on multiple devices, because the maintenance of multiple CUDA streams means a very complicated CUDAContext or OpKernelContext or something like that. And it seems that we can make use of multiple devices using data parallelism, which requires only running a net on a single device. |
||
|
||
The Python API `paddle.train` can prepare a context before it calls C++ code `Operator::Run(const Context&)`. | ||
|
||
As an example, `paddle::operator::Recurrent::Run(const Context& ctx)` can then create a new scope by calling `ctx.CreateScope`, and run the step-net with a new context around the new scope: | ||
|
||
```cpp | ||
class Recurrent { | ||
public: | ||
void Run(const Context& ctx) { | ||
auto subscope = ctx.scope_.CreateScope(""); | ||
step_net_.Run(Context{subscope, ctx.places_, ctx.training_}); | ||
if (!ctx.training_) { | ||
ctx.scope_.DeleteScope(subscope); | ||
} | ||
}; | ||
``` | ||
|
||
Another example is that the Gemm operator need to create a tensor on `Context::places_[0]` and assign the tensor to its output variable: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about data parallelism on multi-GPUs? I think that the places_[0] should not be used inside a operator's run method. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Anyway, we must specify one place to aggregate data from multiple GPUs. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @reyoung suggested defining the aggregation as an operator. I am not sure. It seems much easier if a net runs on a single device as in #2696 (comment). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with @QiJune on #2696 (comment) -- using data parallelism and only |
||
|
||
```cpp | ||
class Gemm { | ||
public: | ||
void Run(const Context& ctx) { | ||
if (paddle::platform::IsGPUPlace(ctx.places_[0])) { | ||
cuDNNGemm( | ||
Output(0).mutable_data<float>(ctx.places_[0], DerivedSizeFromInputs()), | ||
...); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here should be cublasGemm. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point. I am reading and reviewing your design. |
||
} else { | ||
mkl::sgemm( | ||
Output(0).mutable_data<float>(ctx.places_[0], DerivedSizeFromInputs()), | ||
...); | ||
} | ||
} | ||
}; | ||
``` | ||
|
||
### Place | ||
|
||
A place indicates a device and its type. We have the following place definitions: | ||
|
||
```cpp | ||
struct GPUPlace { | ||
int device_; // GPU id. | ||
}; | ||
|
||
struct CPUPlace { | ||
enum Type { | ||
X86, | ||
ARM5, | ||
ARM6, | ||
... | ||
}; | ||
Type type_; | ||
}; | ||
``` | ||
|
||
We can add more Place implementations, like FPGAPlace and XeonPhiPlace, in the future. | ||
|
||
### Gradient Operators | ||
|
||
A gradient operator should be build and linked only if we are building a binary that supports training. If we are building an "inference-only" binary, we shouldn't link gradient operators. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Gradient Operator could also be a forward operator. For example, operator |
||
|
||
Gradient operations should be created only if we are going to train a neural network. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe
InferShape
is also need?