Move batching and field logic from inputter to dsets #1196

flauted · 2019-01-22T13:33:22Z

Move inputters.make_img -> inputters.image_dataset.batch_img (no change other than renaming)
Move inputters.make_audio -> inputters.image_dataset.batch_audio (no change other than renaming)
*Move inputters._feature_tokenize -> inputters.text_dataset._feature_tokenize (no change)
Refactor get_fields to dispatch to the datatype's (new) fields function. The function takes arbitrary arguments so it's extensible without a case-like structure over the datatypes in get_fields. It returns fields that are to be "top-level" (src_lengths for audio) as well as ones that are scoped under the dataset's side (src/tgt).

Originally part of #1194

bpopeters · 2019-01-22T14:44:02Z

I'm in favor of moving things out of inputter.py because it's not at all clear what an "inputter" is. This seems very promising. The refactor of get_fields is much needed.

Could you explain the distinction between top-level and other fields? It seems to necessitate the text_fields and image_fields functions returning empty lists alongside their real output, and I wonder if there is a way to handle this without that.

flauted · 2019-01-22T15:55:20Z

It's there because of this bit in the current master get_fields:

if src_data_type == 'audio':
    # only audio has src_lengths
    length = Field(use_vocab=False, dtype=torch.long, sequential=False)
    fields["src_lengths"] = [("src_lengths", length)]

For whatever reason, that "src_lengths" field is scoped at the same level as "src" and "tgt" in the fields dictionary. Maybe that can be changed, I don't know enough about what the audio datatype does with it.

bpopeters · 2019-01-22T16:20:31Z

I think it is used for sequence packing, the same as the lengths used for the text source field. It would be nice if the lengths could be created the same way for audio as for text, but I don't think it can it can be done unless the audio field has sequential=True, but it's not clear to me what will happen when combining sequential=True with use_vocab=False.

vince62s · 2019-01-22T16:41:43Z

A while ago I tried to get rid of src_lengths (number of frames) for audio, based on padding.
But it is not so easy IIRC there is also this pooling thing that conflicts a bit.
Anyway, if you guys think it's ready to merge, let mle know.

flauted · 2019-01-22T19:23:38Z

I blew away src_lengths. @vince62s you're right that the padding in torchtext's field conflicts. So using the torchtext Field and getting the lengths out of it is seemingly impossible - you'll always have some attribute conflicting and trying to treat it like a sequence of text. My solution was writing an AudioSeqField that inherits Field to override the conflicting functions (and set sensible defaults), which, in my opinion, is a lot cleaner than using src_lengths.

The batch_audio functionality fit nicely into the overridden pad function, so I got rid of the function.

@bpopeters I guess you can be the judge of whether it's actually cleaner to use a Fields subclass rather than use that toplevel_fields solution. I can always revert.

vince62s · 2019-01-22T19:33:59Z

onmt/inputters/inputter.py


-    if isinstance(batch.__dict__[side], tuple) or side == 'tgt':
+    if not data.dim() == 4 and (
+            isinstance(batch.__dict__[side], tuple) or side == 'tgt'):


not sure to follow this data.dim() == 4 test

Oh yeah, that's not a good change. Sorry.

It used to be if data_type == 'text':, but #1184 axed the data_type argument and checks if batch.src is a tuple since, before, only text was a tuple. Now audio is a tuple too.

I'd suggest going back to passing data_type into get_features now that there's an actual ambiguity. Is that okay?

I don't think you have access to data_type in all the places where you call make_features

Actually I'm looking at the usages and that's not okay. I wonder if there's a way to attach the data type to the batch.

In the long run, by the time it's batched it shouldn't be necessary to know the data type. It's just a batch of data that the model should be able to handle. It only comes up right now because our way of handling feature embeddings (a feature almost no one uses) is super hacky.

Well what would be a better way ? When I used to use the Lua version, it was very convenient to use feature embeddings and somehow efficient. In onmt-py today just the fact that we don't support target features makes this useless but it would be a good thing to have. no ?

Feature embeddings are undoubtedly a good thing to have that I have used in my own research. The question is just when, where, and how they get numericalized. I think the best way would be some kind of multilevel field that takes an input which is words plus arbitrarily many features and when numericalized produced the sort of stacked tensor that the embeddings module expects. The NestedField in torchtext almost, but not quite, fits the bill.

It would save the trainer and translator from having to reason about the types of the source and target data.

oh ok I misread the "(almost no one uses)" comment.
So in the meantime, are we ok to set a data_type attribute to batch ?

Yes, provided we understand that it's a temporary solution until such a time as the data processing pipeline is capable of numericalizing its own input.

I can probably take on building that multilevel field. @bpopeters would you please make an issue with that description of what's desired so we have a place to discuss design if I have questions?

onmt/translate/translator.py

…no longer necessary

flauted · 2019-01-22T22:58:57Z

@bpopeters @vince62s I think this is ready to merge.

vince62s · 2019-01-23T09:01:49Z

ok merging.

This reverts commit bd465cc.

#1203) This reverts commit bd465cc.

flauted added 2 commits January 22, 2019 08:27

Move batching and field logic from inputter to dsets.

8af58cb

Remove _feature_tokenize from inputter.

f395638

Merge branch 'master' into fields

2bd30a2

Remove src_lengths from audio.

3177176

vince62s reviewed Jan 22, 2019

View reviewed changes

onmt/translate/translator.py Outdated Show resolved Hide resolved

flauted added 3 commits January 22, 2019 14:57

Make lengths a torch.int instead of self.dtype in AudioSeqField.

71c1271

Don't output src_lengths in AudioDataset.make_examples since they're …

3957afb

…no longer necessary

Add temp fix for checking if data is text in make_features.

2dbb48a

flauted mentioned this pull request Jan 23, 2019

Improve feature embeddings implementation #1200

Closed

vince62s merged commit bd465cc into OpenNMT:master Jan 23, 2019

vince62s added a commit that referenced this pull request Jan 23, 2019

Revert " Move batching and field logic from inputter to dsets (#1196)"

b142888

This reverts commit bd465cc.

vince62s mentioned this pull request Jan 23, 2019

Revert " Move batching and field logic from inputter to dsets" #1203

Merged

vince62s added a commit that referenced this pull request Jan 23, 2019

Revert " Move batching and field logic from inputter to dsets (#1196)" (

e307d9d

#1203) This reverts commit bd465cc.

This was referenced Jan 23, 2019

Fields branch backwards compatibility flauted/OpenNMT-py#6

Closed

Move batching and field logic from inputter to dsets #1210

Merged

Multilevel text field #1216

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move batching and field logic from inputter to dsets #1196

Move batching and field logic from inputter to dsets #1196

flauted commented Jan 22, 2019 •

edited

Loading

bpopeters commented Jan 22, 2019

flauted commented Jan 22, 2019

bpopeters commented Jan 22, 2019

vince62s commented Jan 22, 2019

flauted commented Jan 22, 2019

vince62s Jan 22, 2019

flauted Jan 22, 2019

vince62s Jan 22, 2019

flauted Jan 22, 2019

bpopeters Jan 22, 2019

vince62s Jan 22, 2019

bpopeters Jan 22, 2019

vince62s Jan 22, 2019

bpopeters Jan 22, 2019

flauted Jan 22, 2019

flauted commented Jan 22, 2019

vince62s commented Jan 23, 2019

Move batching and field logic from inputter to dsets #1196

Move batching and field logic from inputter to dsets #1196

Conversation

flauted commented Jan 22, 2019 • edited Loading

bpopeters commented Jan 22, 2019

flauted commented Jan 22, 2019

bpopeters commented Jan 22, 2019

vince62s commented Jan 22, 2019

flauted commented Jan 22, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flauted commented Jan 22, 2019

vince62s commented Jan 23, 2019

flauted commented Jan 22, 2019 •

edited

Loading