Skip to content

A very preliminary draft #2696

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

wangkuiyi
Copy link
Collaborator

No description provided.

class Variable {
public:
bool Estimated() const;
bool SetEstimated(bool);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the meaning of this return bool?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can make it just void.


The global and nested local scopes form a hierarchy. The following Python functions make it convenient to program scopes:

1. `paddle.scope.current()` returns the current scope, which defaults to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this line which defaults to not complete?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that it defaults to the value returned by the next (second) line. I will rewrite it to make it clear.

};
```

Another example is that the Gemm operator need to create a tensor on `Context::places_[0]` and assign the tensor to its output variable:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about data parallelism on multi-GPUs? I think that the places_[0] should not be used inside a operator's run method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, we must specify one place to aggregate data from multiple GPUs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reyoung suggested defining the aggregation as an operator. I am not sure. It seems much easier if a net runs on a single device as in #2696 (comment).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @QiJune on #2696 (comment) -- using data parallelism and only places_[0] would be enough.

if (paddle::platform::IsGPUPlace(ctx.places_[0])) {
cuDNNGemm(
Output(0).mutable_data<float>(ctx.places_[0], DerivedSizeFromInputs()),
...);
Copy link
Member

@QiJune QiJune Jul 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here should be cublasGemm.
And cublasGemm needs to acquire a cublasHandle to finish computation.
And nearly all the computation in Run method need to acquire a eigen device.
Please refer to #2648

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I am reading and reviewing your design.

std::vector<Place> places_; // a network might run on multiple devices.
bool training_;
};
```
Copy link
Member

@QiJune QiJune Jul 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the context of Operator's Run method is different from Net's Run method.
The Net can run on multi-devices, but the operator can only run on a specific device.
So, the operator may need a OpContext to Run.

Copy link
Collaborator Author

@wangkuiyi wangkuiyi Jul 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be frank I am not even sure that a net should be able to run on multiple devices. It seems that we can use multiple devices by doing data parallelism -- each device runs only one copy of the net, and we can do gradient aggregation using NCCL. In this way, it seems that both net and operators need just one device.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that both net and operators need just one device.

Is there a higher level concept to run NCCL in C++ or just let Python run NCCL? If that concept in CPP, it seems that is also a Network.

class MultiDeviceNetwork {
private:
  // holds networks on each device.
  vector<Network> networks_;
}

However, I suggest that we should only concern about the single device in basic Network. It is easy to change a single device Network into multiple device Network by using NCCL.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to what I understand now, I'd avoid running a network on multiple devices, because the maintenance of multiple CUDA streams means a very complicated CUDAContext or OpKernelContext or something like that.

And it seems that we can make use of multiple devices using data parallelism, which requires only running a net on a single device.


### Gradient Operators

Each operator has a corresponding gradient operator that defines the gradient computation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One forward op is not only corresponding to one gradient operator but gradient operators.

For example, FcOp has three inputs. Not all of them needs gradient in practice. It could implement each input corresponding to one gradient operator, so FcOp could have three gradient ops.

More generally, Caffe2 and MXNet return an array of ops as one op's gradient ops.

x = paddle.variable.operation("x", paddle.operator.gemm(...)) # an operation
x = paddle.variable.tensor("x",
numpy.random.randn(200, 100), # set the value of the tensor
estimated = true) # will be updated by the backward algorithm.
Copy link
Collaborator

@reyoung reyoung Jul 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

estimated is not very straight forward name. Maybe need_backward or requires_grad is better? Not all variables that need gradient will be updated, only parameters do.


### Gradient Operators

A gradient operator should be build and linked only if we are building a binary that supports training. If we are building an "inference-only" binary, we shouldn't link gradient operators.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gradient Operator could also be a forward operator.

For example, operator mul's gradient operator is also mul.

```python
x = paddle.variable.new("x") # create a variable of not yet known type
x = paddle.variable.tensor("x") # create a tensor typed varible without value
x = paddle.variable.int("x", 10) # create an unnamed int varaible and set to 10
Copy link
Collaborator

@reyoung reyoung Jul 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that variable unnamed or typo? Maybe x = paddle.variable.int(10) ?
I cannot find why it is necessary for a variable can be unnamed?

Copy link
Collaborator

@reyoung reyoung Jul 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe an independent Scope could represent all attribute variables of an operator?

Like here I proposed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see that attribute is so special and cannot be represented by a Variable.


### Operators as Functions

An operator is intrinsically a function, which has inputs and outputs. For example, the functional representation of the GEMM operator is
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe InferShape is also need?

}
```

Note that operators might call other operators. In above example, `gemm` calls `act`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a better example for an Op call other Ops is RnnOp. :-) Because normal gemm doesn't contain activation.

Some Python API proposals:

```python
x = paddle.variable.new("x") # create a variable of not yet known type
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This APIs could be unified by one paddle.variable.new?

x_var = paddle.variable.new("x", 10)

y = paddle.variable.Tensor()
y_var = paddle.variable.new("y", y)

z = paddle.variable.new("z", "A string")

no_name = paddle.variable.new(name=None, val=100)  # actually I think each variable should have a name. It might have some independent scope to hold them.

x = paddle.variable.new("x") # create a variable of not yet known type
x = paddle.variable.tensor("x") # create a tensor typed varible without value
x = paddle.variable.int("x", 10) # create an unnamed int varaible and set to 10
x = paddle.variable.operation("x", paddle.operator.gemm(...)) # an operation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why a variable can be an operation? Is that need for RnnOp?

};
```

`Get` and `GetMutable` implements *lazy memory allocation*, as described in the [Variable design doc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/variable.md).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe only GetMutable implements lazy memory allocation. Get should raise an error if that variable is not created?

std::shared_ptr<Scope> parent_;
std::vector<Scope*> children_;

Mutex mutex_; // Make this class thread-safe.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems reasonable, maybe I should update Scope design and add mutex today.

A neural network is a program. Training or inference is to execute it. The runtime environment of execution is known as a *context*:

1. a scope,
1. device(s), or places,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the only device for GPU is not enough. It could be a DeviceContext for each GPU which holds

  • computation stream
  • handles for cuDNN/cuBLAS etc.

private:
std::map<std::string /*name*/, std::unique_ptr<Variable> > vars_;
std::shared_ptr<Scope> parent_;
std::vector<Scope*> children_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need children_ of a scope?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug printing is the only usage in my mind.

@luotao1
Copy link
Contributor

luotao1 commented Feb 1, 2019

感谢您给PaddlePaddle贡献文档。由于文档已迁移至FluidDoc repo,因此关闭您的PR,欢迎您向FluidDoc Repo贡献文档。
Thanks for contributing to PaddlePaddle! Since documents have been moved to FluidDoc repo, we close this PR. Welcome to contribute to FluidDoc repo.

@luotao1 luotao1 closed this Feb 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants