-
Notifications
You must be signed in to change notification settings - Fork 5.7k
A very preliminary draft #2696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A very preliminary draft #2696
Conversation
class Variable { | ||
public: | ||
bool Estimated() const; | ||
bool SetEstimated(bool); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the meaning of this return bool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can make it just void.
|
||
The global and nested local scopes form a hierarchy. The following Python functions make it convenient to program scopes: | ||
|
||
1. `paddle.scope.current()` returns the current scope, which defaults to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this line which defaults to
not complete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant that it defaults to the value returned by the next (second) line. I will rewrite it to make it clear.
}; | ||
``` | ||
|
||
Another example is that the Gemm operator need to create a tensor on `Context::places_[0]` and assign the tensor to its output variable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about data parallelism on multi-GPUs? I think that the places_[0] should not be used inside a operator's run method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, we must specify one place to aggregate data from multiple GPUs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reyoung suggested defining the aggregation as an operator. I am not sure. It seems much easier if a net runs on a single device as in #2696 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @QiJune on #2696 (comment) -- using data parallelism and only places_[0]
would be enough.
if (paddle::platform::IsGPUPlace(ctx.places_[0])) { | ||
cuDNNGemm( | ||
Output(0).mutable_data<float>(ctx.places_[0], DerivedSizeFromInputs()), | ||
...); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here should be cublasGemm.
And cublasGemm needs to acquire a cublasHandle to finish computation.
And nearly all the computation in Run method need to acquire a eigen device.
Please refer to #2648
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I am reading and reviewing your design.
std::vector<Place> places_; // a network might run on multiple devices. | ||
bool training_; | ||
}; | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the context of Operator's Run method is different from Net's Run method.
The Net can run on multi-devices, but the operator can only run on a specific device.
So, the operator may need a OpContext
to Run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be frank I am not even sure that a net should be able to run on multiple devices. It seems that we can use multiple devices by doing data parallelism -- each device runs only one copy of the net, and we can do gradient aggregation using NCCL. In this way, it seems that both net and operators need just one device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems that both net and operators need just one device.
Is there a higher level concept to run NCCL
in C++ or just let Python run NCCL
? If that concept in CPP, it seems that is also a Network.
class MultiDeviceNetwork {
private:
// holds networks on each device.
vector<Network> networks_;
}
However, I suggest that we should only concern about the single device in basic Network
. It is easy to change a single device Network into multiple device Network
by using NCCL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up to what I understand now, I'd avoid running a network on multiple devices, because the maintenance of multiple CUDA streams means a very complicated CUDAContext or OpKernelContext or something like that.
And it seems that we can make use of multiple devices using data parallelism, which requires only running a net on a single device.
|
||
### Gradient Operators | ||
|
||
Each operator has a corresponding gradient operator that defines the gradient computation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One forward op is not only corresponding to one gradient operator but gradient operators.
For example, FcOp
has three inputs. Not all of them needs gradient in practice. It could implement each input corresponding to one gradient operator, so FcOp
could have three gradient ops.
More generally, Caffe2 and MXNet return an array of ops as one op's gradient ops.
x = paddle.variable.operation("x", paddle.operator.gemm(...)) # an operation | ||
x = paddle.variable.tensor("x", | ||
numpy.random.randn(200, 100), # set the value of the tensor | ||
estimated = true) # will be updated by the backward algorithm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
estimated
is not very straight forward name. Maybe need_backward
or requires_grad
is better? Not all variables that need gradient will be updated, only parameters do.
|
||
### Gradient Operators | ||
|
||
A gradient operator should be build and linked only if we are building a binary that supports training. If we are building an "inference-only" binary, we shouldn't link gradient operators. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gradient Operator could also be a forward operator.
For example, operator mul
's gradient operator is also mul
.
```python | ||
x = paddle.variable.new("x") # create a variable of not yet known type | ||
x = paddle.variable.tensor("x") # create a tensor typed varible without value | ||
x = paddle.variable.int("x", 10) # create an unnamed int varaible and set to 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that variable unnamed
or typo? Maybe x = paddle.variable.int(10)
?
I cannot find why it is necessary for a variable can be unnamed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think maybe an independent Scope could represent all attribute variables of an operator?
Like here I proposed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see that attribute is so special and cannot be represented by a Variable.
|
||
### Operators as Functions | ||
|
||
An operator is intrinsically a function, which has inputs and outputs. For example, the functional representation of the GEMM operator is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe InferShape
is also need?
} | ||
``` | ||
|
||
Note that operators might call other operators. In above example, `gemm` calls `act`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a better example for an Op call other Ops is RnnOp
. :-) Because normal gemm doesn't contain activation.
Some Python API proposals: | ||
|
||
```python | ||
x = paddle.variable.new("x") # create a variable of not yet known type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This APIs could be unified by one paddle.variable.new
?
x_var = paddle.variable.new("x", 10)
y = paddle.variable.Tensor()
y_var = paddle.variable.new("y", y)
z = paddle.variable.new("z", "A string")
no_name = paddle.variable.new(name=None, val=100) # actually I think each variable should have a name. It might have some independent scope to hold them.
x = paddle.variable.new("x") # create a variable of not yet known type | ||
x = paddle.variable.tensor("x") # create a tensor typed varible without value | ||
x = paddle.variable.int("x", 10) # create an unnamed int varaible and set to 10 | ||
x = paddle.variable.operation("x", paddle.operator.gemm(...)) # an operation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, why a variable can be an operation? Is that need for RnnOp?
}; | ||
``` | ||
|
||
`Get` and `GetMutable` implements *lazy memory allocation*, as described in the [Variable design doc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/variable.md). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe only GetMutable
implements lazy memory allocation. Get
should raise an error if that variable is not created?
std::shared_ptr<Scope> parent_; | ||
std::vector<Scope*> children_; | ||
|
||
Mutex mutex_; // Make this class thread-safe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems reasonable, maybe I should update Scope design and add mutex today.
A neural network is a program. Training or inference is to execute it. The runtime environment of execution is known as a *context*: | ||
|
||
1. a scope, | ||
1. device(s), or places, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the only device for GPU is not enough. It could be a DeviceContext
for each GPU which holds
- computation stream
- handles for cuDNN/cuBLAS etc.
private: | ||
std::map<std::string /*name*/, std::unique_ptr<Variable> > vars_; | ||
std::shared_ptr<Scope> parent_; | ||
std::vector<Scope*> children_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need children_ of a scope?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug printing is the only usage in my mind.
No description provided.