-
-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added the wrapper Bidirectional
for RNN layers
#1790
base: master
Are you sure you want to change the base?
Conversation
|
||
# Concatenate the forward and reversed backward weights | ||
function (m::Bidirectional)(x::Union{AbstractVecOrMat{T},OneHotArray}) where {T} | ||
return vcat(m.forward(x), reverse(m.backward(reverse(x; dims=1)); dims=1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return vcat(m.forward(x), reverse(m.backward(reverse(x; dims=1)); dims=1)) | |
return vcat(m.forward(x), reverse(m.backward(reverse(x; dims=3)); dims=3)) |
Sorry, just found this. When applying Flux RNNs on dense sequences, the temporal dim is actually the last one. See
Flux.jl/src/layers/recurrent.jl
Line 83 in dd6b70e
function (m::Recur)(x::AbstractArray{T, 3}) where T |
Flux.jl/src/layers/recurrent.jl
Line 27 in dd6b70e
Folding over a 3d Array of dimensions `(features, batch, time)` is also supported: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm quite confused now because I was expecting the variable x
to be the result of an OneHotMatrix
operation on a sentence and would be a matrix. In my experiments, I was using an array of one-hot encoded sentences (that were padded to fixed size) like this:
X_onehot = [OneHotMatrix(x, vocab_size) for x in X]
y_onehot = [OneHotMatrix(x, num_tags) for x in y]
where X
is an array of padded sentences and y is the corresponding labels of each word in the sentence (I am experimenting with Named Entity Recognition models). So the input data would be a matrix (batch, (time, features))
and not in the (features, batch, time)
format.
I am not sure how to proceed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://fluxml.ai/Flux.jl/stable/models/recurrence/#Working-with-sequences has most of the nitty-gritty details about working with batched sequences for RNNs. In short, the only supported formats are ((features, batch), time)
and (features, batch, time)
. Unlike Python frameworks, Flux puts the batch dim last for all layers because of column major layout. RNNs just add another time dim after that.
Since this definition of Bidirectional takes a contiguous array instead of a vector of arrays, m.forward()
and m.backward()
dispatch to (m::Recur)(x::AbstractArray{T, 3})
. To support both, you'd need something like the following (note: untested!):
(m::Bidirectional)(x::AbstractArray{T, 3}) where T
for(features, batch, time)
(m::Bidirectional)(x::Vector{<:AbstractVecOrMat{T}}) where T
for((features, batch), time)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the rnn layers do not literally accept a vector of matrices. At the link this is iterated through by hand. Should method 2 of Bidirectional(fwd, rev)(x)
handle that?
julia> LSTM(3,5)(randn(Float32, 3,)) |> size
(5,)
julia> LSTM(3,5)(randn(Float32, 3,7)) |> size
(5, 7)
julia> LSTM(3,5)(randn(Float32, 3,7,11)) |> size
(5, 7, 11)
julia> LSTM(3,5)([randn(Float32, 3,7) for _ in 1:11]) |> size
ERROR: MethodError
julia> function (b::Bidirectional)(xs::AbstractVector{<:AbstractVecOrMat})
top = [b.forward(x) for x in xs]
bot = reverse([b.reverse(x) for x in reverse(xs)])
vcat.(top, bot)
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should method 2 of
Bidirectional(fwd, rev)(x)
handle that?
Yes, exactly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And applied to randn(Float32, 3,7)
, should it fail, or still reverse in the 3rd dimension?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The signatures specifically prohibit that, so it should fail. Did you have a specific edge case in mind that isn't caught by the above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I just wonder about the mismatch between what LSTM accepts and what this accepts. Prohibiting things which don't make sense is good, you want to know soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, the RNN layers can work with single timesteps but Bidirectional
can't, as it needs to see the whole sequence up-front in order to reverse it. If anything the former should be brought in line with the latter, but that's a conversation for another issue (#1678 probably).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the great conversation here and the suggestions. I just tested and the Bidirectional
, as it is, is able to process randn(Float32, 3,7)
:
Bidirectional(LSTM, 3, 5)(randn(Float32, 3, 7)) |> size
(10, 7)
But about the format, I am still confused on how to preprocess the text data so that it would end in (seq_length, (features, samples))
format. It seems counterintuitive to me.
I usually follow: read the data -> split sentences -> split words -> pad -> one-hot
. So my data would be an array with N sentences, where every sentence is described as a one-hot matrix of its words. In this way, I ended up with the (samples, (features, seq_length))
format. How should I preprocess the text data so that I would end up with (seq_length, (features, samples))
.
Also, by checking these formats I discovered that I should probably use dims=2
in my formulation on the reverse
function (not use the default to reverse in all dimensions, it should reverse only the time that in my case is the second dimension of the onehotmatrix).
@dcecchini do you have local (behavioural) tests that weren't pushed? I only see changes in |
I don't think so, I did all the changes on |
…w method to the `show.jl` file. Modified the constructor to accept `a...` and `ka...` instead of only the `in` and `out`.
Testing wise, having something in https://github.com/FluxML/Flux.jl/blob/6168692ddb2680ebe02f461bf0c59c2b6cb0ff1e/test/layers/recurrent.jl that checks the logic (need not be complex) would be good. You can see how this is done for other composite layers such as |
|
||
|
||
""" | ||
Bidirectional{A,B} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we settle on the constructor, this line should be the various constructors not the type definition.
To be clear, accepting that input is a bug, not a feature :P
Workflow-wise, the key is to put the batch dim in between the time and
feature ones. There are a few ways to do this, but one of the easiest would
be Flux.onehotbatch(transpose(reduce(hcat, batch_of_padded_sentences)),
vocab). See the Flux.onehotbatch
<https://fluxml.ai/Flux.jl/stable/data/onehot/#Flux.onehotbatch> docs for
more details. We're actively thinking about how to make working with RNNs
more ergonomic (the old interface is not terribly so), but in the meantime
getting your data into (samples, seq_length) or (samples, (seq_length)) first
and then doing the one-hot transform is the easiest way to go about things.
…On Mon, 29 Nov 2021 at 11:12, David Cecchini ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In src/layers/recurrent.jl
<#1790 (comment)>:
> + backward::B
+end
+
+# Generic constructor for every case
+Bidirectional(forward, f_in::Integer, f_out::Integer, backward, b_in::Integer, b_out::Integer) = Bidirectional(forward(f_in, f_out), backward(b_in, b_out))
+
+# Constructor for forward and backward having the same size
+Bidirectional(forward, backward, in::Integer, out::Integer) = Bidirectional(forward(in, out), backward(in, out))
+
+# Constructor to add the same cell as forward and backward with given input and output sizes
+Bidirectional(rnn, in::Integer, out::Integer) = Bidirectional(rnn(in, out), rnn(in, out))
+
+
+# Concatenate the forward and reversed backward weights
+function (m::Bidirectional)(x::Union{AbstractVecOrMat{T},OneHotArray}) where {T}
+ return vcat(m.forward(x), reverse(m.backward(reverse(x; dims=1)); dims=1))
Thank you for the great conversation here and the suggestions. I just
tested and the Bidirectional, as it is, is able to process randn(Float32,
3,7):
Bidirectional(LSTM, 3, 5)(randn(Float32, 3, 7)) |> size
(10, 7)
But about the format, I am still confused on how to preprocess the text
data so that it would end in (seq_length, (features, samples)) format. It
seems counterintuitive to me.
I usually follow: read the data -> split sentences -> split words -> pad
-> one-hot. So my data would be an array with N sentences, where every
sentence is described as a one-hot matrix of its words. In this way, I
ended up with the (samples, (features, seq_length)) format. How should I
preprocess the text data so that I would end up with (seq_length,
(features, samples)).
Also, by checking these formats I discovered that I should probably use
dims=2 in my formulation on the reverse function (not use the default to
reverse in all dimensions, it should reverse only the time that in my case
is the second dimension of the onehotmatrix).
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#1790 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA67XP2QOU46T6PVT6X34MTUOPGB7ANCNFSM5I4SLATQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Yes! To add the batch between time and feature is not difficult, the problem is to have the batch as the last dimension! I was able to transform in [[X_onehot[i][:, j] for i in 1:length(X_onehot)] for j in 1:size(X_onehot[1])[2]] but after checking your example I was able to do this: aux = reduce(hcat, X_pad); # size(aux) = (time, samples)
Xs = [Flux.onehotbatch(aux[i, :], 1:vocab_size) for i in 1:size(aux )[1]];
size(Xs) # (time, )
size(Xs[1]) # (features, samples) So, at the end, I had to use the one-hot encoding in a different way: instead of using it in a sentence, use it in arrays representing first words, seconds words etc. Edit ----After experimenting with this, I was unable to make it work with So I really would prefer the "bugged" version as it allows to send the whole sentence in a batch and it works fine passing it as Anyone has other solution? |
You'd have to pass each timestep in separately to the embedding layer, because it takes an input of size
Same story here. Alternatively, you could use dense arrays all the way through. Given a set of inputs of shape
Just to clarify, using the time dimension as the batch dim means that each token in the sentence will be processed in parallel...which is probably not what you want with an RNN :P |
Any updates for this PR? |
Questions about this implementation of Bidirectional RNN: My question is, since bidirectional RNN requires a full time step (forward hidden state calculation ➡️ + backward hidden state calculation⬅️), so, for a single time step (1/seq_length) input, we cannot calculate the hidden state of bidirectional RNN, right? But if we initialize model in this PR: julia> model = Bidirectional(LSTM, 3, 5)
Bidirectional(LSTMCell(3 => 5), LSTMCell(3 => 5))
julia> x = rand(Float32, 3)
3-element Vector{Float32}:
0.40451288
0.40546536
0.7272219
julia> model(x)
10-element Vector{Float32}:
-0.15061164
-0.105684996
-0.08399847
0.113812014
-0.03911737
0.061232302
-0.011352453
-0.0018385417
-0.06642
0.006111855 It output single step result, is this work expected? I think the correct behavior is (for batch size 1): # input size(all_time_step) = (sequence_len * features)
julia> model(all_time_step)
# output size = (sequence_len * outsize*2) |
It seems that the original author has given up on implementing it, and I want to try to continue his work (this is related to my graduation, |
return vcat(m.forward(x), reverse(m.backward(reverse(x; dims=1)); dims=1)) | ||
end | ||
|
||
@functor Bidirectional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flux.@layer
here?
I think BRNN should take all the time step, so We can't use it like RNN that we have implemented, which means: This usage is valid in RNN: # 2 => feature, 3 => sequence_len, 4 => batch_size
x = [[1 2 1 2; 3 4 3 4] for i = 1:3]
rnn = RNN(2=>5)
# out_dim = 3(sequence_len) * 5(hidden) * 4(batch_size)
rnn_out = [rnn(xi) for xi in x] But same usage is invalid in BRNN: # 2 => feature, 3 => sequence_len, 4 => batch_size
x = [[1 2 1 2; 3 4 3 4] for i = 1:3]
brnn = Bidirectional(RNN,2,5)
# out_dim = 3(sequence_len) * 10(hidden) * 4(batch_size)
brnn_out = [brnn(xi) for xi in x] Because in function (m::Bidirectional)(x::Union{AbstractVecOrMat{T},Flux.OneHotArray}) where {T}
#size(x) = (2,4)(feature*batch_size)
return vcat(m.forward(x), reverse(m.backward(reverse(x; dims=1)); dims=1))
end the size of |
So for BRNN input, I think correct dimensions is: For 2d-matrix input or vector of vector input, which means # seq_len = 3
# feature = 2
# vector of vector
x = [rand(Float32,2) for i = 1:3]
# or
x = rand(Float32,3,2)
# out put all hidden state, with double size.
# size = sequence_len * 2hidden_size
brnn(x) For vector of 2d-matrix input, which means # seq_len = 3
# feature = 2
# batch_size = 4
x = [[1 2 1 2; 3 4 3 4] for i = 1:3]
# out put all hidden state, with double size.
# size = sequence_len * 2hidden_size * batch_size
brnn_out = brnn(x) |
I think the correct implement is: function (m::Bidirectional)(x::Union{AbstractVecOrMat{T},Flux.OneHotArray}) where {T}
vec_for = [m.forward(xi) for xi in x]
vec_back = reverse([m.backward(xi) for xi in reverse(x)])
return vcat.(vec_for,vec_back)
end for no batch input: x = [rand(Float32,2) for i = 1:3]
# size = 3(seq_len) * 10(double hidden_size)
brnn(x)
# size = 3(seq_len) * 5(hidden_size)
[rnn(xi) for xi in x] for input with batchs: x = [[1 2 1 2; 3 4 3 4] for i = 1:3]
# size = 3(seq_len) * 10(double hidden_size) * 4(batch_size)
brnn(x1)
# size = 3(seq_len) * 5(hidden_size) * 4(batch_size)
[rnn(xi) for xi in x] The output size of RNN and BRNN is same |
Hi @NeroBlackstone , at that time I could not finish the implementation due to the difficulty to get something to work on the designed structure for RNN cells as you can see in the history above. I would be glad to work on this, but I have little time to spare. |
PR Checklist
Description
The implementation was based on the implementation of
BiLSTM_CNN_CRF_Model
present on TextModels.jl but has a few differences:Bidirectional
usage closer toKeras
as it allows the creation of, for example,Bidirectional(Flux.LSTM, in, out)
inside aChain
declarations.dims=1
parameter on thereverse
function, which was needed in my experiments but more generic implementation may be possible (I was not able to useflip
probably for the same reason).I also updated the
Base.show
and_big_show
functions so that it is printed in the same style ofFlux
giving the number of trainable parameters. The implementation added for the_big_show
could be added tosrc/show.jl
later.This is my first PR here, I read the information on
CONTRIBUTING.md
, but please let me know if it is not following the correct standards. I really enjoyJulia
even though I am still learning.Hope I can contribute to this great project
Flux
.