Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move batching and field logic from inputter to dsets #1196

Merged
merged 7 commits into from
Jan 23, 2019

Conversation

flauted
Copy link
Contributor

@flauted flauted commented Jan 22, 2019

  • Move inputters.make_img -> inputters.image_dataset.batch_img (no change other than renaming)
  • Move inputters.make_audio -> inputters.image_dataset.batch_audio (no change other than renaming)
    *Move inputters._feature_tokenize -> inputters.text_dataset._feature_tokenize (no change)
  • Refactor get_fields to dispatch to the datatype's (new) fields function. The function takes arbitrary arguments so it's extensible without a case-like structure over the datatypes in get_fields. It returns fields that are to be "top-level" (src_lengths for audio) as well as ones that are scoped under the dataset's side (src/tgt).

Originally part of #1194

@bpopeters
Copy link
Contributor

I'm in favor of moving things out of inputter.py because it's not at all clear what an "inputter" is. This seems very promising. The refactor of get_fields is much needed.

Could you explain the distinction between top-level and other fields? It seems to necessitate the text_fields and image_fields functions returning empty lists alongside their real output, and I wonder if there is a way to handle this without that.

@flauted
Copy link
Contributor Author

flauted commented Jan 22, 2019

It's there because of this bit in the current master get_fields:

if src_data_type == 'audio':
    # only audio has src_lengths
    length = Field(use_vocab=False, dtype=torch.long, sequential=False)
    fields["src_lengths"] = [("src_lengths", length)]

For whatever reason, that "src_lengths" field is scoped at the same level as "src" and "tgt" in the fields dictionary. Maybe that can be changed, I don't know enough about what the audio datatype does with it.

@bpopeters
Copy link
Contributor

I think it is used for sequence packing, the same as the lengths used for the text source field. It would be nice if the lengths could be created the same way for audio as for text, but I don't think it can it can be done unless the audio field has sequential=True, but it's not clear to me what will happen when combining sequential=True with use_vocab=False.

@vince62s
Copy link
Member

A while ago I tried to get rid of src_lengths (number of frames) for audio, based on padding.
But it is not so easy IIRC there is also this pooling thing that conflicts a bit.
Anyway, if you guys think it's ready to merge, let mle know.

@flauted
Copy link
Contributor Author

flauted commented Jan 22, 2019

I blew away src_lengths. @vince62s you're right that the padding in torchtext's field conflicts. So using the torchtext Field and getting the lengths out of it is seemingly impossible - you'll always have some attribute conflicting and trying to treat it like a sequence of text. My solution was writing an AudioSeqField that inherits Field to override the conflicting functions (and set sensible defaults), which, in my opinion, is a lot cleaner than using src_lengths.

The batch_audio functionality fit nicely into the overridden pad function, so I got rid of the function.

@bpopeters I guess you can be the judge of whether it's actually cleaner to use a Fields subclass rather than use that toplevel_fields solution. I can always revert.


if isinstance(batch.__dict__[side], tuple) or side == 'tgt':
if not data.dim() == 4 and (
isinstance(batch.__dict__[side], tuple) or side == 'tgt'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure to follow this data.dim() == 4 test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, that's not a good change. Sorry.

It used to be if data_type == 'text':, but #1184 axed the data_type argument and checks if batch.src is a tuple since, before, only text was a tuple. Now audio is a tuple too.

I'd suggest going back to passing data_type into get_features now that there's an actual ambiguity. Is that okay?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you have access to data_type in all the places where you call make_features

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I'm looking at the usages and that's not okay. I wonder if there's a way to attach the data type to the batch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the long run, by the time it's batched it shouldn't be necessary to know the data type. It's just a batch of data that the model should be able to handle. It only comes up right now because our way of handling feature embeddings (a feature almost no one uses) is super hacky.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well what would be a better way ? When I used to use the Lua version, it was very convenient to use feature embeddings and somehow efficient. In onmt-py today just the fact that we don't support target features makes this useless but it would be a good thing to have. no ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feature embeddings are undoubtedly a good thing to have that I have used in my own research. The question is just when, where, and how they get numericalized. I think the best way would be some kind of multilevel field that takes an input which is words plus arbitrarily many features and when numericalized produced the sort of stacked tensor that the embeddings module expects. The NestedField in torchtext almost, but not quite, fits the bill.

It would save the trainer and translator from having to reason about the types of the source and target data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh ok I misread the "(almost no one uses)" comment.
So in the meantime, are we ok to set a data_type attribute to batch ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, provided we understand that it's a temporary solution until such a time as the data processing pipeline is capable of numericalizing its own input.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can probably take on building that multilevel field. @bpopeters would you please make an issue with that description of what's desired so we have a place to discuss design if I have questions?

@flauted
Copy link
Contributor Author

flauted commented Jan 22, 2019

@bpopeters @vince62s I think this is ready to merge.

@vince62s
Copy link
Member

ok merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants