-
Notifications
You must be signed in to change notification settings - Fork 19.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recurrent Attention API: Support constants in RNN #7980
Recurrent Attention API: Support constants in RNN #7980
Conversation
…them on to the cell
Any idea why it seems to fail with CNTK (according to Travis)? |
Can you explain why this would be necessary? I believe RNN cells that need specific constants should be managing their own constants, e.g. what we do for dropout (the parent RNN is agnostic to dropout constants). I am pretty sure Attention wrappers would be implementable with the current API |
The key here is that the constants (the “attended” when used for attention) really are inputs to the the recurrent layer (I refer to them as “external” constants in docs but maybe we need a better name). I think it would be bad and confusing to break the principle that all inputs to a layer should be passed to that layer in The constant(s) needed for attention is fundamentally different from dropout, as the later is only a regularisation method and we don’t pass the dropout values in inference mode (they are only generated in in training) but the attended, of course, have to passed as input. When it comes to implementation/API, it is certainly possible to implement attention in keras only using custom wrapper - I’ve done it several times and that was the solution in my original API suggestion, which I think is worse ;). But let’s looks at some options... This is the “look and feel” to the user of my current suggestion:
I guess what your suggestion means is that you would only let the cell know about the attended, something like:
Or maybe:
You can probably make this work but I'd like to avoid it for a number of reasons:
I think a good comparison is why we can (optionally) pass Another option would be to treat the attended as part of the Yet another (better) option is to add a new class It is also, of course, possible that I'm missing some other option. Please question the above! |
The fact that we in the current RNN implementation "temporarily" modify the |
Thanks for the detailed explanation! It's a reasonable choice. An alternative would be the following:
y = AttentionRNN(cell)(x, attended_sequence=z) What do you think? In terms of PR process, I suggest doing a single PR for the whole thing. That will make the purpose of the various pieces more apparent. |
Sounds great! Basically, the additions to A few details:
So in summary:
Where the Totally agree about the PRs, Let's close second one to keep discussion in one place! |
Cool, let's go with this then! |
Great, I'll update this PR and the API doc this weekend. |
…ention_api_unified
… outputs in wrapped model
keras/layers/recurrent.py
Outdated
@@ -194,6 +196,130 @@ def get_losses_for(self, inputs=None): | |||
return losses | |||
|
|||
|
|||
class FunctionalRNNCell(Wrapper): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is RNNCellModel
better name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that's not a good idea. It's not a Model
from the aspect that it should not have compile
, fit
, predict
etc... it's should just "package" the cell based on the functional definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably override __call__
as well.
This is a big PR. If any users of Attention or people familiar with Keras RNNs could help with reviewing, that would be greatly appreciated. |
Yes, would be great with input from more contributors/users. Here is a guide to the additions: New class
New class
In the There are also many more tests required, but I want some feedback/review before continuing. |
Could we move attention support to the base class Also: what would be some alternative names for |
This is exactly what I implemented in a backward compatible way (and tested) in the original version of this PR (I then used the keyword argument
As discussed above, "sequence" should not be in the name. Again, the attended can be anything (including multiple) tensors. A standard application is image captioning where the attended is an image (or typically CNN feature maps of an image). I do still think it would be better to support passing, what I originally referred to i docs as, external constants in the base As per the current state of this PR, Regarding naming: If we were to stick to an additional separate Generally, I prefer to consider The theano "non_sequences" is not a very user-friendly name as these "external constants" could be sequences but not iterated over by the RNN cell. The reason I don't want to go for ...I apologise for these "essay" comments. This is an intricate matter. I've really tried to do a thorough requirements analysis with respect to both how Attention is used in research and the current state of Keras (which e.g. lead to my conclusion that the RNNCell should really be broken out before this addition). Perhaps it was too early to move to PRs but on the other hand it is the only way to really show the implications and feasibility of the suggestion. It is quite slow to work with the sparse, and occasionally inconsistent ;), feedback. I'd be happy to set up a Hangouts session to discuss details and options directly! |
Summary of the very long comment above:
|
I've been looking at these PRs for some time and I'm still not sure. It would definitely be helpful to have more people chime in with their perspective. I think you're probably right, it would be good to add If we do this, the API changes would be:
|
Besides just the semantic problem of having multiple similar but independent mechanisms with the same name, one thing to consider is performance. Different backends might want to optimize the way they manage constants. Theano does it to some extent, for instance. |
Let's see... how bad would it be to have something like: class RNN(...):
def __call__(self, inputs, initial_state=None, constants=None, **kwargs):
# handle connectivity of `constants`, like we do for `initial_state`
...
def call(self, inputs, mask=None, training=None, initial_state=None, constants=None):
...
if constants:
if not has_arg(self.cell.call, 'constants'):
raise ValueError(...)
def step(inputs, states):
constants = inputs[1:]
if has_arg(self.cell.call, 'training'):
return self.cell.call(inputs[0], constants=constants, training=training)
else:
return self.cell.call(inputs[0], constants=constants)
last_output, outputs, states = K.rnn(step,
inputs,
initial_state,
go_backwards=self.go_backwards,
mask=mask,
constants=constants,
unroll=self.unroll,
input_length=timesteps) |
Good, at this point it seems obvious that we should add
I'm not sure exactly what you mean by this, but yes it's important that
I don't think there is anything strange about having a layer
Yes, this is important and quite complex. I wrote about it in my previous comments and put a note on it in the PR (https://github.com/fchollet/keras/pull/7980/files#diff-3118e4e28157032506f771f279a551c3R2340). I agree that its better to pass constants to backend
This is not too bad - it's one way of making sure that the constants are passed via backend
Where I think this mistake is symptomatic of one of the issues with the Keras recurrent code base - that arguments are not always what they "claim to be" - and will be different things at different times: "states can be either states - or states and constants" and "inputs can be either inputs - or inputs and initial_state" (in I'd recommend working towards removing this pattern. We should try to make The other example, that |
Please trust me on this one; it will destroy readability of Attention Mechanisms and cause a lot of implementation overhead when trying to make them modular if we start mixing |
Yes, passing all input tensors as a flat list is part of the internal API. On the user-facing side of things, we can always avoid doing that (and instead pass the additional arguments as keyword arguments), but it is unavoidable at model deserialization time. I agree that it's difficult to figure out a reasonable way to parse the list of inputs. Since the problem only occurs at deserialization time, it should be possible to store in the layer config whether or not initial states and constants are "keras tensors" (i.e. whether they are part of the model topology), which would then provide all the information needed for deserialization. Keep in mind that this is an implementation detail. The public API is still to use keyword arguments. |
It is purely an implementation detail and will not modify the signatures of the |
Ok great, (1.) is not a problem then. As you suggest, I'm not sure I understood the answer to (2.), is my understanding correct that we should not change expected signature of (backend |
I totally appreciate Keras extreme focus on the public API. But for the library (and API) to stay relevant, it has to continue to develop. The structure and readability of its internals are crucial for easy extension and for people to continue to contribute. Many improvements could be done that are in no conflict with public API. ...The quote above sounds a bit scary ;) |
Yes, indeed, we can simply wrap the calls to the step function, as in the code snippet. Your other comment: my quote on readability means to say that since method signatures don't change, the implementation of |
…ention_api_constants_support
Ok, so I've updated this PR back to "supporting of constants in For overview, there is a functioning implementation (based on the additions in this PR) of next steps:
here: andhus#3. I didn't want to pull it in to this PR until we've settled on the changes in There is also an explanation of the example in the linked PR (and a benchmark MoGAttention vs. regular seq-to-seq) here: https://github.com/andhus/extkeras/tree/master/examples/attention |
for dim in self.cell.state_size] | ||
else: | ||
self.state_spec = InputSpec(shape=(None, self.cell.state_size)) | ||
self.state_spec = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For more advanced cells (typically attention cells such as this one: https://github.com/andhus/keras/pull/3/files#diff-9c4188ddc4dd173d80f64feed5b89412R258) the cell.state_size
is not defined until the cell is built. But that's not a problem, state_spec
is just set a bit later in build (or in __call__
if initial_state
is passed to RNN
)
keras/layers/recurrent.py
Outdated
initial_state = inputs[1:] | ||
inputs = inputs[0] | ||
def __call__(self, inputs, initial_state=None, constants=None, **kwargs): | ||
inputs, initial_state, constants = self._normalize_args( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer "standardize" (also used elsewhere) since "normalize" has a specific meaning elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look good to me! Comments are mostly nitpicking points.
keras/layers/recurrent.py
Outdated
check_list += constants | ||
self.constants_spec = [InputSpec(shape=K.int_shape(constant)) | ||
for constant in constants] | ||
self._n_constants = len(constants) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In line with naming conventions in this API, this should be _num_constants
.
keras/layers/recurrent.py
Outdated
|
||
is_keras_tensor = hasattr(initial_state[0], '_keras_history') | ||
for tensor in initial_state: | ||
check_list = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unclear what is meant by check_list
. Maybe input_list
would be better?
keras/layers/recurrent.py
Outdated
raise ValueError('The initial state of an RNN layer cannot be' | ||
' specified with a mix of Keras tensors and' | ||
' non-Keras tensors') | ||
raise ValueError('The initial state and constants of an RNN' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and -> or
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I think I wanted to refer to that they "jointly" must contain only one kind of tensors (that's what checked for now)... I don't have any opinion on the error msg, but just to verify/understand: shouldn't all inputs be keras tensors or not - and why not include inputs
then as well in the check (not only constants
and initial_state
)?
keras/layers/recurrent.py
Outdated
@@ -560,6 +614,48 @@ def call(self, inputs, mask=None, training=None, initial_state=None): | |||
else: | |||
return output | |||
|
|||
def _normalize_args(self, inputs, initial_state, constants): | |||
"""When running a model loaded from file, the input tensors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: first line of docstring (one-line summary) should fit in one line and end with a period.
keras/layers/recurrent.py
Outdated
@@ -560,6 +614,48 @@ def call(self, inputs, mask=None, training=None, initial_state=None): | |||
else: | |||
return output | |||
|
|||
def _normalize_args(self, inputs, initial_state, constants): | |||
"""When running a model loaded from file, the input tensors | |||
`initial_state` and `constants` can be passed to RNN.__call__ as part |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
` around code keywords
keras/layers/recurrent.py
Outdated
constants. | ||
|
||
This method makes sure initial_states and constants are separated from | ||
inputs and that the are lists of tensors (or None). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the -> they
keras/layers/recurrent.py
Outdated
initial_state = inputs[1:] | ||
inputs = inputs[0] | ||
|
||
def to_list_or_none(x): # TODO break out? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless you want to use it elsewhere, it is fine as it is, you can remove this comment
keras/layers/recurrent.py
Outdated
@@ -618,6 +714,9 @@ def get_config(self): | |||
'go_backwards': self.go_backwards, | |||
'stateful': self.stateful, | |||
'unroll': self.unroll} | |||
if self._n_constants is not None: | |||
config['_n_constants'] = self._n_constants |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please call it num_constants
is the config.
keras/layers/recurrent.py
Outdated
@@ -292,6 +293,14 @@ class RNN(Layer): | |||
`states` should be a numpy array or list of numpy arrays representing | |||
the initial state of the RNN layer. | |||
|
|||
# Note on passing external constants to RNNs | |||
You can pass "external" constants to the cell using the `constants` | |||
keyword argument of RNN.__call__ (as well as RNN.call) method. This |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
` around code keywords
@@ -415,19 +422,46 @@ def compute_mask(self, inputs, mask): | |||
return output_mask | |||
|
|||
def build(self, input_shape): | |||
# Note input_shape will be list of shapes of initial states and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note input_shape will be list of shapes of input followed by shapes of initial states and constants if the later as passed in __call__
I fixed the comments and added a test case for when both What do you think: should we aim to merge this first separately (a pretty well defined feature addition) or continue to pull in next steps (actual attention mechanisms) before merging anything? |
We'll merge it first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
@fchollet saw that build is failing on master: https://travis-ci.org/fchollet/keras/jobs/292414692 |
@andhus it seems after this merged into the master CuDNNLSTM has broke. After pulling I get error:
I have latest nightly tensorflow |
Unfortunately the cuDNN RNN tests cannot be run on Travis since they require a GPU. We need to remember to run them manually everytime we make changes to I will look at the issue. |
Yeah, thanks for taking care of this, did not think about that addition! |
Continuation of Recurrent Attention API here: #8296 |
This PR is the first step to support recurrent attention mechanisms as suggested in #7633.
In short, it adds the possibility to pass
constants
(besidesinputs
andinitial_state
) when calling the RNN layer. The RNN will pass on the tensor(s) to the call method of the RNN Cell which (if constants are passed) is expected to implement the signature:call(inputs, states, constants)