Multilevel text field #1216

flauted · 2019-01-24T16:03:44Z

This is meant to resolve issue #1200 and originated with comments on #1196 .

This introduces a multilevel Field subclass to improve the handling of text feature embeddings.

…no longer necessary

bpopeters · 2019-01-24T16:11:18Z

onmt/inputters/inputter.py

+    assert len(fields["tgt"]) == 1
+    tgt_multifield = fields["tgt"][0][1]
+    for tgt_name, tgt_field in tgt_multifield:
+        _build_field_vocab(tgt_field, counters[tgt_name])


Could you incorporate the changes in #1199 into this?

Yes, I've done so.

bpopeters · 2019-01-24T16:25:28Z

onmt/inputters/text_dataset.py

+class TextMultiField(RawField):
+    def __init__(self, base_name, base_field, feats_fields):
+        super(TextMultiField, self).__init__()
+        self.base_field = base_field


I'm not sure it needs to have separate attributes for the base field and the feature fields. Inside the class, the distinction doesn't matter and the code in process() could be streamlined if all of the layers were in the same list, something like this:

levels = [f.process(b, device=device) for _, f in self.fields]

And then if we still want a clean way to use just the base or feature fields, we can do it with a property, i.e.

@property def base_field(self): return self.fields[0]

Good idea. I went with it. Unfortunately it doesn't streamline process() since the base field may or may not return a length whereas the feature fields never do.

bpopeters · 2019-01-24T16:27:46Z

onmt/inputters/inputter.py

@@ -166,15 +171,6 @@ def make_features(batch, side):
    else:


Is it possible to remove make_features entirely at this point? It basically just unpacks a tuple and no longer has anything to do with features.

I'd advise against it. It's 6 lines of code that would need inlined in 6 places: trainer.py Trainer._gradient_accumulation twice, Trainer.validate twice, and then translator.py Translator._run_encoder once and Translator._score_target.

Calling make_features(batch, 'src') is equivalent to this one-liner:

src, lengths = batch.src if isinstance(batch.src, tuple) else batch.src, None

Calling make_features(batch, 'tgt') is equivalent to this one-liner:

tgt = batch.tgt

This is much less than six lines of code. In either case, I think both one-liners are clearer than calling make_features. Even if the function is kept, it should not be called make_features because that isn't what it does.

You're right! I got rid of make_features with those one liners. Thanks

bpopeters · 2019-01-24T16:29:10Z

This looks like great work. I noted a few small things inline.

…NMT#1199, rename TextMultiField attrs.

vince62s · 2019-01-24T19:22:55Z

onmt/inputters/inputter.py

+    for name, field in multifield:
+        _build_field_vocab(field, counters[name], **build_fv_args[name])
+        logger.info(" * %s vocab size: %d." % (name, len(field.vocab)))
+


Creating these 2 functions may trigger a future issue in checkpoints if we change again.
Could we make them local in build_vocab ?

Sure although I honestly don't see how that would come up. Neither those functions nor build_vocab should be checkpointed, right?

no, you're correct. But now I am confused why _feature_tokenize WAS checkpointed in the version between 14 days ago and this morning commit.

It still is. text_fields - used to be part of get_fields - passes a partial of _feature_tokenize into torchtext.data.Field. Field makes it as an attr. And model_saver.py ModelSaver pickles the fields.

vince62s · 2019-01-24T20:28:32Z

Unless we are 100% sure it does not break anything, I suggest to release 0.7.1 the current master so that we have a pretty stable hash. Then we can continue.
What do you think ?

flauted · 2019-01-24T20:42:42Z

I still don't like checkpointing the entirety of Fields as we were discussing on #1200. If that's going to change I think it should be before a release. But I'm also not 100% sure this doesn't break anything. I haven't had time to test this commit with text features. I don't think Travis or the PR test script tests that either. So whatever you think is best.

flauted · 2019-01-25T04:03:29Z

I tested it out with features. I believe it's all working now except using target features causes translation to fail (maybe expected?). Same problem on master so I opened an issue. Also I noticed it isn't backwards compatible with checkpoints made on master. That can likely be fixed.

bpopeters · 2019-01-25T11:06:18Z

The failure of target features is expected.

flauted · 2019-01-25T17:24:13Z

It's now backwards compatible with checkpoints made with master. However, it will not train with data preprocessed on the current master. If that's okay, then I think this is ready to merge.

vince62s · 2019-01-25T17:27:32Z

You are confusing me. Are you saying: it will not train_from checkpoints_current_master with new data_from_preprocess_your_PR ?

flauted · 2019-01-25T17:31:18Z

Sorry. I'm saying:
If you start on master and run preprocess.py and train.py, you can go to this PR and run translate.py with the training checkpoint.
However, if you start on master and run preprocess.py, you can NOT go to this PR and run train.py.

vince62s · 2019-01-25T17:35:51Z

ok, what if, you train_from with the PR, from a master checkpoint. should be ok ?

flauted · 2019-01-25T17:48:37Z

Yes.

git checkout master
python preprocess.py -train_src data/src-train-feats.txt -train_tgt data/tgt-train.txt -valid_src data/src-val-feats.txt -valid_tgt data/tgt-val.txt -save_data data/demo-feats
python train.py -data data/demo-feats -save_model demo-model -train_steps 5 -save_checkpoint_steps 5 -batch_size 5

git checkout multilevel-text-field
# preprocess again
rm data/demo-feats.*.pt
python preprocess.py -train_src data/src-train-feats.txt -train_tgt data/tgt-train.txt -valid_src data/src-val-feats.txt -valid_tgt data/tgt-val.txt -save_data data/demo-feats
python train.py -data data/demo-feats -save_model demo-model -train_steps 10 -save_checkpoint_steps 10 -batch_size 5 --train_from demo-model_step_5.pt

# translate final checkpoint
python translate.py -model demo-model_step_10.pt -src data/src-test-feats.txt -verbose
# translate intermediate checkpoint (made on master)
python translate.py -model demo-model_step_5.pt -src data/src-test-feats.txt -verbose

All that works. (If you skip preprocessing again, it fails.)

vince62s · 2019-01-25T18:53:13Z

ok let's merge.

flauted added 12 commits January 22, 2019 08:27

Move batching and field logic from inputter to dsets.

8af58cb

Remove _feature_tokenize from inputter.

f395638

Merge branch 'master' into fields

2bd30a2

Remove src_lengths from audio.

3177176

Make lengths a torch.int instead of self.dtype in AudioSeqField.

71c1271

Don't output src_lengths in AudioDataset.make_examples since they're …

3957afb

…no longer necessary

Add temp fix for checking if data is text in make_features.

2dbb48a

First pass at a multi-level field design.

31cae5a

Remove some unused code.

ea19d87

Merge branch 'master' into multilevel-text-field

4551293

Remove batch.src_is_text attr.

47fa339

Clean up by adding __iter__ to MultiField.

184c426

bpopeters reviewed Jan 24, 2019

View reviewed changes

flauted added 2 commits January 24, 2019 12:49

Fix extract embeddings.

1d95347

Clean build_vocab (including bad indentation level), incorporate Open…

d952335

…NMT#1199, rename TextMultiField attrs.

vince62s reviewed Jan 24, 2019

View reviewed changes

flauted added 2 commits January 24, 2019 14:58

Remove make_features.

d13c6b1

Merge branch 'master' into multilevel-text-field

d76ec5b

Update semantics in direct calls to batch.src, batch.tgt.

dfc6036

Test for old-style text fields while checking for old-style vocab.

31246c9

vince62s merged commit 93930ab into OpenNMT:master Jan 25, 2019

This was referenced Jan 25, 2019

Add missing vocab_size and min_frec in building vocabulary #1199

Closed

Fix batch type "tokens" #1234

Merged

This was referenced Jan 30, 2019

Fix sequence length filtering #1239

Merged

AttributeError: 'TextMultiField' object has no attribute 'vocab' #1249

Closed

francoishernandez mentioned this pull request Nov 24, 2020

Implement language model support, decoder only #1919

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multilevel text field #1216

Multilevel text field #1216

flauted commented Jan 24, 2019

bpopeters Jan 24, 2019

flauted Jan 24, 2019

bpopeters Jan 24, 2019

flauted Jan 24, 2019

bpopeters Jan 24, 2019

flauted Jan 24, 2019

bpopeters Jan 24, 2019

flauted Jan 24, 2019

bpopeters commented Jan 24, 2019

vince62s Jan 24, 2019

flauted Jan 24, 2019

vince62s Jan 24, 2019

flauted Jan 24, 2019

vince62s commented Jan 24, 2019

flauted commented Jan 24, 2019

flauted commented Jan 25, 2019

bpopeters commented Jan 25, 2019

flauted commented Jan 25, 2019

vince62s commented Jan 25, 2019

flauted commented Jan 25, 2019

vince62s commented Jan 25, 2019

flauted commented Jan 25, 2019

vince62s commented Jan 25, 2019

Multilevel text field #1216

Multilevel text field #1216

Conversation

flauted commented Jan 24, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bpopeters commented Jan 24, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vince62s commented Jan 24, 2019

flauted commented Jan 24, 2019

flauted commented Jan 25, 2019

bpopeters commented Jan 25, 2019

flauted commented Jan 25, 2019

vince62s commented Jan 25, 2019

flauted commented Jan 25, 2019

vince62s commented Jan 25, 2019

flauted commented Jan 25, 2019

vince62s commented Jan 25, 2019