-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move batching and field logic from inputter to dsets #1196
Conversation
I'm in favor of moving things out of Could you explain the distinction between top-level and other fields? It seems to necessitate the |
It's there because of this bit in the current master
For whatever reason, that "src_lengths" field is scoped at the same level as "src" and "tgt" in the |
I think it is used for sequence packing, the same as the lengths used for the text source field. It would be nice if the lengths could be created the same way for audio as for text, but I don't think it can it can be done unless the audio field has |
A while ago I tried to get rid of src_lengths (number of frames) for audio, based on padding. |
I blew away The @bpopeters I guess you can be the judge of whether it's actually cleaner to use a Fields subclass rather than use that |
onmt/inputters/inputter.py
Outdated
|
||
if isinstance(batch.__dict__[side], tuple) or side == 'tgt': | ||
if not data.dim() == 4 and ( | ||
isinstance(batch.__dict__[side], tuple) or side == 'tgt'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure to follow this data.dim() == 4 test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, that's not a good change. Sorry.
It used to be if data_type == 'text':
, but #1184 axed the data_type argument and checks if batch.src is a tuple since, before, only text was a tuple. Now audio is a tuple too.
I'd suggest going back to passing data_type into get_features now that there's an actual ambiguity. Is that okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you have access to data_type in all the places where you call make_features
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I'm looking at the usages and that's not okay. I wonder if there's a way to attach the data type to the batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the long run, by the time it's batched it shouldn't be necessary to know the data type. It's just a batch of data that the model should be able to handle. It only comes up right now because our way of handling feature embeddings (a feature almost no one uses) is super hacky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well what would be a better way ? When I used to use the Lua version, it was very convenient to use feature embeddings and somehow efficient. In onmt-py today just the fact that we don't support target features makes this useless but it would be a good thing to have. no ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feature embeddings are undoubtedly a good thing to have that I have used in my own research. The question is just when, where, and how they get numericalized. I think the best way would be some kind of multilevel field that takes an input which is words plus arbitrarily many features and when numericalized produced the sort of stacked tensor that the embeddings module expects. The NestedField
in torchtext almost, but not quite, fits the bill.
It would save the trainer and translator from having to reason about the types of the source and target data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh ok I misread the "(almost no one uses)" comment.
So in the meantime, are we ok to set a data_type attribute to batch ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, provided we understand that it's a temporary solution until such a time as the data processing pipeline is capable of numericalizing its own input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can probably take on building that multilevel field. @bpopeters would you please make an issue with that description of what's desired so we have a place to discuss design if I have questions?
@bpopeters @vince62s I think this is ready to merge. |
ok merging. |
inputters.make_img -> inputters.image_dataset.batch_img
(no change other than renaming)inputters.make_audio -> inputters.image_dataset.batch_audio
(no change other than renaming)*Move
inputters._feature_tokenize -> inputters.text_dataset._feature_tokenize
(no change)get_fields
to dispatch to the datatype's (new) fields function. The function takes arbitrary arguments so it's extensible without a case-like structure over the datatypes inget_fields
. It returns fields that are to be "top-level" (src_lengths
for audio) as well as ones that are scoped under the dataset's side (src/tgt).Originally part of #1194