Skip to content
This repository was archived by the owner on Jul 7, 2023. It is now read-only.

Fixed tensor2tensor language modeling decode #1282

Merged
merged 11 commits into from
Jan 11, 2019

Conversation

mikeymezher
Copy link
Contributor

@mikeymezher mikeymezher commented Dec 6, 2018

Allowing for variable batch sizes, removes padding and EOS token insertion
(See: 1227)

@googlebot googlebot added the cla: yes PR author has signed CLA label Dec 6, 2018
@lukaszkaiser lukaszkaiser merged commit 31701a4 into tensorflow:master Jan 11, 2019
@lukaszkaiser
Copy link
Contributor

Thanks for this and sorry for waiting so long! I had to remove parts and in a follow-up I'll unluckily remove more to make it work in the newest state. But thanks to is, LM decoding works now (at least at batch size 1 after my update), great thanks!

tensorflow-copybara pushed a commit that referenced this pull request Jan 11, 2019
PiperOrigin-RevId: 228809213
@mikeymezher
Copy link
Contributor Author

@lukaszkaiser
There still seems to be an issue. (char 0 in most vocabs) is still being added in decode_batch_input_fn for batch sizes >1.

As it stands, this is only guaranteed to work with batch size 1 for LM problems (since batch sizes > 1 will cause inputs of lesser length to have appended -- which would cause inaccurate decoding). This means decoding from file for LM will be unnecessarily slow.

My fix was meant to group all sequences of same length together so never needs to be used. In my local testing the _decode_batch_input_fn_no_padding function I wrote works fine, and the results of LM are as expected.

@mikeymezher
Copy link
Contributor Author

mikeymezher commented Feb 13, 2019

I should also mention my function requires _get_sorted_inputs(...) to be called first (which probably should be called regardless of whether or not my function is used, since sequences causing OOM can still be an issue in LM problems), which if you're testing on t2t v1.12.0 is no longer called for language modeling problems (now _get_language_modeling inputs(...) called instead). To keep support for SubwordTokenizer, _get_language_modeling_inputs(..) can probably be called after _get_sorted_inputs(...) if not has_input

@mikeymezher
Copy link
Contributor Author

@lukaszkaiser A better solution than what I previously implemented would be to left pad the timing signal/positional embedding as well as left pad the inputs. That way shape is maintained so batching without regrouping by sequence length can be done and the timing signal isn't shifted.

kpe pushed a commit to kpe/tensor2tensor that referenced this pull request Mar 2, 2019
* Changed reuse val from true to tf.AUTO_REUSE in top to allow for proper weight initialization

* Fixed decoding from file for language modeling problems (when has_input=False)

* nan

* Fixed language modeling decoding from file (allowing variable batch size without padding and EOS insertion)

* Delete misc.xml

* Delete modules.xml

* Delete tensor2tensor.iml

* Delete vcs.xml

* Delete workspace.xml
kpe pushed a commit to kpe/tensor2tensor that referenced this pull request Mar 2, 2019
PiperOrigin-RevId: 228809213
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cla: yes PR author has signed CLA
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants