-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multilevel text field #1216
Multilevel text field #1216
Conversation
…no longer necessary
onmt/inputters/inputter.py
Outdated
assert len(fields["tgt"]) == 1 | ||
tgt_multifield = fields["tgt"][0][1] | ||
for tgt_name, tgt_field in tgt_multifield: | ||
_build_field_vocab(tgt_field, counters[tgt_name]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you incorporate the changes in #1199 into this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've done so.
onmt/inputters/text_dataset.py
Outdated
class TextMultiField(RawField): | ||
def __init__(self, base_name, base_field, feats_fields): | ||
super(TextMultiField, self).__init__() | ||
self.base_field = base_field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it needs to have separate attributes for the base field and the feature fields. Inside the class, the distinction doesn't matter and the code in process() could be streamlined if all of the layers were in the same list, something like this:
levels = [f.process(b, device=device) for _, f in self.fields]
And then if we still want a clean way to use just the base or feature fields, we can do it with a property, i.e.
@property
def base_field(self):
return self.fields[0]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. I went with it. Unfortunately it doesn't streamline process() since the base field may or may not return a length whereas the feature fields never do.
onmt/inputters/inputter.py
Outdated
@@ -166,15 +171,6 @@ def make_features(batch, side): | |||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to remove make_features
entirely at this point? It basically just unpacks a tuple and no longer has anything to do with features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd advise against it. It's 6 lines of code that would need inlined in 6 places: trainer.py Trainer._gradient_accumulation
twice, Trainer.validate
twice, and then translator.py Translator._run_encoder
once and Translator._score_target
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling make_features(batch, 'src')
is equivalent to this one-liner:
src, lengths = batch.src if isinstance(batch.src, tuple) else batch.src, None
Calling make_features(batch, 'tgt')
is equivalent to this one-liner:
tgt = batch.tgt
This is much less than six lines of code. In either case, I think both one-liners are clearer than calling make_features
. Even if the function is kept, it should not be called make_features
because that isn't what it does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right! I got rid of make_features
with those one liners. Thanks
This looks like great work. I noted a few small things inline. |
…NMT#1199, rename TextMultiField attrs.
for name, field in multifield: | ||
_build_field_vocab(field, counters[name], **build_fv_args[name]) | ||
logger.info(" * %s vocab size: %d." % (name, len(field.vocab))) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating these 2 functions may trigger a future issue in checkpoints if we change again.
Could we make them local in build_vocab ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure although I honestly don't see how that would come up. Neither those functions nor build_vocab
should be checkpointed, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, you're correct. But now I am confused why _feature_tokenize WAS checkpointed in the version between 14 days ago and this morning commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It still is. text_fields
- used to be part of get_fields
- passes a partial of _feature_tokenize
into torchtext.data.Field
. Field
makes it as an attr. And model_saver.py
ModelSaver
pickles the fields.
Unless we are 100% sure it does not break anything, I suggest to release 0.7.1 the current master so that we have a pretty stable hash. Then we can continue. |
I still don't like checkpointing the entirety of |
I tested it out with features. I believe it's all working now except using target features causes translation to fail (maybe expected?). Same problem on master so I opened an issue. Also I noticed it isn't backwards compatible with checkpoints made on master. That can likely be fixed. |
The failure of target features is expected. |
It's now backwards compatible with checkpoints made with master. However, it will not train with data preprocessed on the current master. If that's okay, then I think this is ready to merge. |
You are confusing me. Are you saying: it will not train_from checkpoints_current_master with new data_from_preprocess_your_PR ? |
Sorry. I'm saying: |
ok, what if, you train_from with the PR, from a master checkpoint. should be ok ? |
Yes.
All that works. (If you skip preprocessing again, it fails.) |
ok let's merge. |
This is meant to resolve issue #1200 and originated with comments on #1196 .
This introduces a multilevel Field subclass to improve the handling of text feature embeddings.