Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multilevel text field #1216

Merged
merged 18 commits into from
Jan 25, 2019
Merged

Multilevel text field #1216

merged 18 commits into from
Jan 25, 2019

Conversation

flauted
Copy link
Contributor

@flauted flauted commented Jan 24, 2019

This is meant to resolve issue #1200 and originated with comments on #1196 .

This introduces a multilevel Field subclass to improve the handling of text feature embeddings.

assert len(fields["tgt"]) == 1
tgt_multifield = fields["tgt"][0][1]
for tgt_name, tgt_field in tgt_multifield:
_build_field_vocab(tgt_field, counters[tgt_name])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you incorporate the changes in #1199 into this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've done so.

class TextMultiField(RawField):
def __init__(self, base_name, base_field, feats_fields):
super(TextMultiField, self).__init__()
self.base_field = base_field
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it needs to have separate attributes for the base field and the feature fields. Inside the class, the distinction doesn't matter and the code in process() could be streamlined if all of the layers were in the same list, something like this:

levels = [f.process(b, device=device) for _, f in self.fields]

And then if we still want a clean way to use just the base or feature fields, we can do it with a property, i.e.

@property
def base_field(self):
    return self.fields[0]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I went with it. Unfortunately it doesn't streamline process() since the base field may or may not return a length whereas the feature fields never do.

@@ -166,15 +171,6 @@ def make_features(batch, side):
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to remove make_features entirely at this point? It basically just unpacks a tuple and no longer has anything to do with features.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd advise against it. It's 6 lines of code that would need inlined in 6 places: trainer.py Trainer._gradient_accumulation twice, Trainer.validate twice, and then translator.py Translator._run_encoder once and Translator._score_target.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling make_features(batch, 'src') is equivalent to this one-liner:

src, lengths = batch.src if isinstance(batch.src, tuple) else batch.src, None

Calling make_features(batch, 'tgt') is equivalent to this one-liner:

tgt = batch.tgt

This is much less than six lines of code. In either case, I think both one-liners are clearer than calling make_features. Even if the function is kept, it should not be called make_features because that isn't what it does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right! I got rid of make_features with those one liners. Thanks

@bpopeters
Copy link
Contributor

This looks like great work. I noted a few small things inline.

for name, field in multifield:
_build_field_vocab(field, counters[name], **build_fv_args[name])
logger.info(" * %s vocab size: %d." % (name, len(field.vocab)))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating these 2 functions may trigger a future issue in checkpoints if we change again.
Could we make them local in build_vocab ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure although I honestly don't see how that would come up. Neither those functions nor build_vocab should be checkpointed, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, you're correct. But now I am confused why _feature_tokenize WAS checkpointed in the version between 14 days ago and this morning commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still is. text_fields - used to be part of get_fields - passes a partial of _feature_tokenize into torchtext.data.Field. Field makes it as an attr. And model_saver.py ModelSaver pickles the fields.

@vince62s
Copy link
Member

Unless we are 100% sure it does not break anything, I suggest to release 0.7.1 the current master so that we have a pretty stable hash. Then we can continue.
What do you think ?

@flauted
Copy link
Contributor Author

flauted commented Jan 24, 2019

I still don't like checkpointing the entirety of Fields as we were discussing on #1200. If that's going to change I think it should be before a release. But I'm also not 100% sure this doesn't break anything. I haven't had time to test this commit with text features. I don't think Travis or the PR test script tests that either. So whatever you think is best.

@flauted
Copy link
Contributor Author

flauted commented Jan 25, 2019

I tested it out with features. I believe it's all working now except using target features causes translation to fail (maybe expected?). Same problem on master so I opened an issue. Also I noticed it isn't backwards compatible with checkpoints made on master. That can likely be fixed.

@bpopeters
Copy link
Contributor

The failure of target features is expected.

@flauted
Copy link
Contributor Author

flauted commented Jan 25, 2019

It's now backwards compatible with checkpoints made with master. However, it will not train with data preprocessed on the current master. If that's okay, then I think this is ready to merge.

@vince62s
Copy link
Member

You are confusing me. Are you saying: it will not train_from checkpoints_current_master with new data_from_preprocess_your_PR ?

@flauted
Copy link
Contributor Author

flauted commented Jan 25, 2019

Sorry. I'm saying:
If you start on master and run preprocess.py and train.py, you can go to this PR and run translate.py with the training checkpoint.
However, if you start on master and run preprocess.py, you can NOT go to this PR and run train.py.

@vince62s
Copy link
Member

ok, what if, you train_from with the PR, from a master checkpoint. should be ok ?

@flauted
Copy link
Contributor Author

flauted commented Jan 25, 2019

Yes.

git checkout master
python preprocess.py -train_src data/src-train-feats.txt -train_tgt data/tgt-train.txt -valid_src data/src-val-feats.txt -valid_tgt data/tgt-val.txt -save_data data/demo-feats
python train.py -data data/demo-feats -save_model demo-model -train_steps 5 -save_checkpoint_steps 5 -batch_size 5

git checkout multilevel-text-field
# preprocess again
rm data/demo-feats.*.pt
python preprocess.py -train_src data/src-train-feats.txt -train_tgt data/tgt-train.txt -valid_src data/src-val-feats.txt -valid_tgt data/tgt-val.txt -save_data data/demo-feats
python train.py -data data/demo-feats -save_model demo-model -train_steps 10 -save_checkpoint_steps 10 -batch_size 5 --train_from demo-model_step_5.pt

# translate final checkpoint
python translate.py -model demo-model_step_10.pt -src data/src-test-feats.txt -verbose
# translate intermediate checkpoint (made on master)
python translate.py -model demo-model_step_5.pt -src data/src-test-feats.txt -verbose

All that works. (If you skip preprocessing again, it fails.)

@vince62s
Copy link
Member

ok let's merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants