-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Input Data Size placeholder missing when using reduce and repeat #1102
Comments
I don't understand, why do you need/use I also don't understand, why do the |
This is leftover from the original full config, in general this could be removed but with the way the data is right now this is required. |
But for reproducing the test case, is this needed? Have you prepared a standalone test case? |
Right now to run the test to fail "successfully" yes. |
Test on RETURNN-common side: def test_reduce_repeat_1102():
# https://github.com/rwth-i6/returnn/issues/1102
class _NARTTSModel(nn.Module):
# noinspection PyShadowingNames
def __call__(
self,
emb: nn.Tensor,
durations: nn.Tensor,
target_speech: nn.Tensor,
time_dim: nn.Dim,
speech_time: nn.Dim,
) -> nn.Tensor:
x = nn.reduce(target_speech, mode="mean", axis=speech_time)
x.mark_as_loss()
rep, rep_dim = nn.repeat(emb, axis=time_dim, repetitions=durations, out_dim=speech_time)
return rep
nn.reset_default_root_name_ctx()
time_dim = nn.SpatialDim("time")
speech_time = nn.SpatialDim("speech")
emb = nn.get_extern_data(nn.Data('emb', dim_tags=[nn.batch_dim, time_dim, nn.FeatureDim('F', 1)]))
durations = nn.get_extern_data(
nn.Data('durations', dim_tags=[nn.batch_dim, time_dim], dtype="int32"))
target_speech = nn.get_extern_data(
nn.Data('target_speech', dim_tags=[nn.batch_dim, speech_time, nn.FeatureDim('speech-feat', 3)]))
net = _NARTTSModel()
out = net(emb, durations, target_speech, time_dim, speech_time)
out.mark_as_default_output()
config = nn.get_returnn_config().get_complete_py_code_str(net)
def _make_feed_dict(extern_data):
d = extern_data.data
return {
d["emb"].placeholder: [[[1.], [2.], [0.]]],
d["emb"].size_placeholder[0]: [3],
d["durations"].placeholder: [[1, 2, 1]],
d["target_speech"].placeholder: [[[1., 2., 3.], [4., 5., 6.], [1., 2., 3.], [1., 2., 3.]]],
d["target_speech"].size_placeholder[0]: [4],
}
dummy_run_net_single_custom(config, make_feed_dict=_make_feed_dict, eval_flag=True) |
Ah, now I see. When producing the test I was not sure how to manually change the feed dict, but this makes a lot of sense. |
Test on RETURNN side: def test_reduce_repeat_1102():
# https://github.com/rwth-i6/returnn/issues/1102
from returnn.tf.util.data import batch_dim, SpatialDim, FeatureDim
time_dim = SpatialDim('time')
F_dim = FeatureDim('F', 1)
speech_dim = SpatialDim('speech')
speech_feat_dim = FeatureDim('speech-feat', 3)
config = Config(dict(extern_data={
'emb': {
'dim_tags': (batch_dim, time_dim, F_dim),
'dtype': 'float32',
'available_for_inference': True
},
'durations': {
'dim_tags': (batch_dim, time_dim),
'dtype': 'int32',
'available_for_inference': True
},
'target_speech': {
'dim_tags': (batch_dim, speech_dim, speech_feat_dim),
'dtype': 'float32',
'available_for_inference': True
}
}))
net_dict = {
'nartts_model_reduce': {
'class': 'copy',
'from': 'reduce',
'loss': 'as_is',
'out_shape': {batch_dim, speech_feat_dim}
},
'output': {
'class': 'copy',
'from': 'repeat',
'out_shape': {batch_dim, F_dim, speech_dim}
},
'reduce': {
'class': 'reduce',
'from': 'data:target_speech',
'mode': 'mean',
'axis': speech_dim,
'out_shape': {batch_dim, speech_feat_dim}
},
'repeat': {
'class': 'repeat',
'from': 'data:emb',
'repetitions': 'data:durations',
'axis': time_dim,
'out_dim': speech_dim,
'out_shape': {batch_dim, F_dim, speech_dim}
}
}
with make_scope() as session:
net = TFNetwork(config=config, eval_flag=True)
net.construct_from_dict(net_dict)
d = net.extern_data.data
feed_dict = {
d["emb"].placeholder: [[[1.], [2.], [0.]]],
d["emb"].size_placeholder[0]: [3],
d["durations"].placeholder: [[1, 2, 1]],
d["target_speech"].placeholder: [[[1., 2., 3.], [4., 5., 6.], [1., 2., 3.], [1., 2., 3.]]],
d["target_speech"].size_placeholder[0]: [4],
}
fetches = net.get_fetches_dict()
session.run(fetches, feed_dict=feed_dict) |
My current hypothesis: The repeat layer So, probably the reduce layer was created first, used the original |
Note that this is also somewhat ambiguous (wrong? not well defined?) in the original code. In rep, rep_dim = nn.repeat(emb, axis=time_dim, repetitions=duration_int) Basically you almost never should set rep_dim.declare_same_as(speech_time) But, as said, I'm not really sure if this code would actually behave just the same as your current code. |
So, the question is, what should actually happen in this case? I.e. some layer which produces a new dim (like
Actually, I think the first case is what we already have mostly, so let's keep it that way. But that just moves the question over to Looking at if isinstance(repetitions, int):
out_dim_ = tag * repetitions
else:
out_dim_ = Dim(description="repeated:%s" % name, kind=tag.kind, derived_from_tag=tag, auto_generated=True)
if out_dim:
out_dim_.declare_same_as(out_dim) Maybe the order of the ...
out_spatial_dims_ = output.dim_tags[num_batch_dims + 1:]
...
if out_spatial_dims:
assert len(out_spatial_dims_) == len(out_spatial_dims)
for i, (out_spatial_dim_, out_spatial_dim) in enumerate(zip(out_spatial_dims_, out_spatial_dims)):
out_spatial_dim_.declare_same_as(out_spatial_dim) So again it is correct. |
Ongoing PR is in #1104. |
It should be fixed now. |
When trying to use reduce with repeat, if the reduce happens over the repeat out time dim returnn crashes with the following error:
The (reduced) network looks like this:
and I uploaded a full config with data here:
https://gist.github.com/Atticus1806/3795c7193b1f022b5a1b107f4c1f28c9
Removing the
x
here and markingvae_speaker_embedding
as loss does not produce this error, so it seems the reduce layer is interfering.The text was updated successfully, but these errors were encountered: