Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocessing: faster build vocab + multiple weighted datasets #1413

Merged
merged 15 commits into from
May 16, 2019

Conversation

francoishernandez
Copy link
Member

@francoishernandez francoishernandez commented Apr 29, 2019

The current preprocessing works in the following manner:

  • build train (from a single src and tgt pair of files) and valid shards;
  • build vocabulary from train shards.

This means train shards are re-loaded after having been dumped.
We have in mind to simplify this by creating the vocabulary along with the train shards.

For now I've just taken the logic from onmt.inputters.inputter.build_vocab and inserted the necessary bits in preprocess.build_save_dataset.
It might not be the cleanest way to do, but it's a start.

Tested on toy text / speech / image datasets, and it seems to work fine.

Glad to have some feedback on a cleaner way to refactor preprocessing codepath(s) @guillaumekln @flauted @bpopeters .

Next ideas would be, either in this PR or in a following one:

  • allow several corpora with different weights (upsampling);
  • allow multiple threads while creating shards.

PS: @flauted while testing stuff for this PR, I stumbled upon a maybe unwanted behaviour introduced here. This, in combination with some -src_words_min_frequency or tgt_words_min_frequency option may lead to removing some of the last tokens of the vocab.

@flauted
Copy link
Contributor

flauted commented Apr 29, 2019

About the PS: Good catch. It's older than that though, see here. I haven't really used the text file vocab feature, but I'm pretty sure it's just a list of (theoretically unique) words. So the counts/freqs are lost. The validation should probably check for non-default min freq/count type options if there's a source vocab path. I don't know if that's in the scope of this PR or should be done separately, though.

@francoishernandez francoishernandez force-pushed the build_vocab_along_shards branch 2 times, most recently from 1a98b12 to 28d14b5 Compare April 29, 2019 16:33
@francoishernandez francoishernandez force-pushed the build_vocab_along_shards branch 3 times, most recently from bf69cad to ded1853 Compare April 29, 2019 17:52
@francoishernandez francoishernandez changed the title [WIP] Build vocab along shards during preprocessing [WIP] Improvements to preprocessing May 2, 2019
@francoishernandez francoishernandez force-pushed the build_vocab_along_shards branch from 9582b63 to b8a82e1 Compare May 2, 2019 15:58
@francoishernandez
Copy link
Member Author

francoishernandez commented May 2, 2019

I'm extending the scope of this PR, adding the possibility to pass several train corpora when preprocessing, and give them specific weights when training.

Preprocessing

I introduce -train_ids which is a list of IDs that will be given to the preprocessed shards.
E.g. we have two corpora : parallel.en and parallel.de + from_backtranslation.en from_backtranslation.de, we can pass

...
-train_src parallel.en from_backtranslation.en \
-train_tgt parallel.de from_backtranslation.de \
-train_ids A B \
-save_data my_data \
...

and it will dump my_data.train_A.X.pt based on parallel.en//parallel.de and my_data.train_B.X.pt based on from_backtranslation.en//from_backtranslation.de.

Training

I introduce -data_ids based on the same principle as above, as well as -data_weights, which is the list of the weight each corpus should have.
E.g.

...
-data my_data \
-data_ids A B \
-data_weights 1 7 \
...

will mean that we'll look for my_data.train_A.*.pt and my_data.train_B.*.pt, and that when building batches, we'll take 1 example from corpus A, then 7 examples from corpus B, and so on.

For this purpose, I created the MultipleDatasetIterator class in onmt.inputters.inputter which takes a list of DatasetLazyIter and the respective list of weights to yield the batch according to the rule described above.
The idea is to spawn an instance of DatasetLazyIter with batch_size 1, and new parameter yield_raw_example that will yield raw examples from each corpus one at a time.

@francoishernandez francoishernandez force-pushed the build_vocab_along_shards branch 7 times, most recently from 30bd610 to 06069e7 Compare May 2, 2019 16:45
@francoishernandez francoishernandez force-pushed the build_vocab_along_shards branch 2 times, most recently from 88f1d7b to 9d62a00 Compare May 3, 2019 10:20
@francoishernandez francoishernandez force-pushed the build_vocab_along_shards branch 2 times, most recently from 9851e80 to 19eff0b Compare May 3, 2019 10:48
@francoishernandez francoishernandez force-pushed the build_vocab_along_shards branch from 19eff0b to bfcb930 Compare May 3, 2019 11:53
@francoishernandez francoishernandez force-pushed the build_vocab_along_shards branch from 951836c to 7957dc9 Compare May 3, 2019 14:03
@vince62s vince62s changed the title [WIP] Improvements to preprocessing Improvements to preprocessing May 3, 2019
@francoishernandez francoishernandez changed the title Improvements to preprocessing [WIP] Improvements to preprocessing May 3, 2019
onmt/inputters/inputter.py Outdated Show resolved Hide resolved
@vince62s vince62s changed the title [WIP] Improvements to preprocessing Preprocessing: faster build vocab + mulitple weighted datasets May 7, 2019
@vince62s vince62s changed the title Preprocessing: faster build vocab + mulitple weighted datasets Preprocessing: faster build vocab + multiple weighted datasets May 7, 2019
@vince62s vince62s changed the title Preprocessing: faster build vocab + multiple weighted datasets [WIP] Preprocessing: faster build vocab + multiple weighted datasets May 8, 2019
preprocess.py Outdated Show resolved Hide resolved
preprocess.py Outdated Show resolved Hide resolved
preprocess.py Show resolved Hide resolved
@vince62s
Copy link
Member

vince62s commented May 8, 2019

beyond the few comments, it does not work. I think the vocab built this way is wrong.
tried a quick training and does not work properly.

@francoishernandez francoishernandez force-pushed the build_vocab_along_shards branch from 8cb3d21 to 5007123 Compare May 14, 2019 09:20
@francoishernandez francoishernandez force-pushed the build_vocab_along_shards branch from 5007123 to 1b2f116 Compare May 14, 2019 09:29
@vince62s vince62s changed the title [WIP] Preprocessing: faster build vocab + multiple weighted datasets Preprocessing: faster build vocab + multiple weighted datasets May 16, 2019
@vince62s
Copy link
Member

Main issue was my fault, seems to work fine now.
Will merge.

@francoishernandez francoishernandez force-pushed the build_vocab_along_shards branch from b8f1f28 to 41c15a2 Compare May 16, 2019 13:03
@francoishernandez francoishernandez force-pushed the build_vocab_along_shards branch from 41c15a2 to ddc1f1c Compare May 16, 2019 13:12
@vince62s vince62s merged commit fae4d62 into OpenNMT:master May 16, 2019
@francoishernandez francoishernandez deleted the build_vocab_along_shards branch June 13, 2019 07:57
rishibommasani added a commit to rishibommasani/OpenNMT-py that referenced this pull request Aug 29, 2019
* advanced noam with decay and accum scheduler (OpenNMT#1367)

* advanced noam with decay and accum scheduler

* Add phrase_table translation argument (OpenNMT#1370)

* Add phrase_table translation argument

If phrase_table is provided (with replace_unk), it will look up the identified source token and give the corresponding target token. If it is not provided (or the identified source token does not exist in the table), then it will copy the source token.

* Have EnsembleDecoder set attentional property. (OpenNMT#1381)

* More efficient embeddings_to_torch.py (OpenNMT#1372)

* Update embeddings_to_torch.py to be more memory efficient by only loading vectors which are present in the vocab into memory.

* remove dead code and flake8 violations introduced with 57cefb7

* update docs of using Glove embeddings. Fix spelling error

* write attention debug to log file (OpenNMT#1384)

* Better handle Cuda OOM with overflow batches (OpenNMT#1385)

* Added earlystopping mechanism (OpenNMT#1389)

* Added earlystopping mechanism
* Fixed earlystopping multi-gpu stoppage

* check vocab files exist at start of preprocessing (OpenNMT#1396)

* Avoid padding indices in MeanEncoder (OpenNMT#1398)

* We avoid padding while mean pooling
* placed batch dimension first for bmm
* replaced accidentally deleted line

* fix Runtime error in Library tutorial (OpenNMT#1399)

* Check -gpu_ranks option to ensure saving a model (OpenNMT#1407)

* Check -gpu_ranks option to ensure saving a model
* split condition to check -gpu_ranks inconsistency

* add src or tgt min frequency to counter value (OpenNMT#1414)

* fix typo (OpenNMT#1416)

* fix goldscore OpenNMT#1383 (OpenNMT#1423)

* fix OpenNMT#1383

* fix gold score only

* Upgrade Travis to Torch 1.1 (OpenNMT#1426)

* Introduce dropout scheduler (OpenNMT#1421)

* add update_dropout methods approx. everywhere, dropout scheduler
* more meaningful log
* forgot some layers in audio_encoder

* Preprocessing: faster build vocab + multiple weighted datasets (OpenNMT#1413)

* handle multiple training corpora and enable weighting
* move fields vocab building logic in function
* fix device handling MultipleDatasetIterator
* fix multi/yield_raw_batch parameter DatasetLazyIter
* update FAQ.md
* add -pool_factor option
* reduce pool_factor for travis runs

* bump version (OpenNMT#1434)

* make MultipleDatasetIterator only if necessary (OpenNMT#1436)

* Update README.md (OpenNMT#1437)

* small fix multi when common root in data_ids (OpenNMT#1444)

* do not overwrite pt vocab when preprocessing again (OpenNMT#1447)

* trim vocab(s) before saving checkpoint (OpenNMT#1453)

* Using Producer-Consumer for batches (OpenNMT#1450)

* Working queues on multi-GPU on text and audio
* Working quite well, even with dynamic_dict
* Remove explicit garbage collect making some queue hang and other fixes
* fix process not ending
* properly set random seed and fill queues sequentially
* make queues work with distributed training

* [fix] Make queue.put() blocking again (OpenNMT#1455)

Fix OpenNMT#1454 .

* Clarify mixed precision training support (OpenNMT#1458)

Change the wording to avoid confusion. Mixed precision ensures both higher arithmetic throughput and numerical stability, not exactly synonymous to pure half-precision/FP16 training. Also add mentioning of tensor cores since older generation GPUs without tensor cores don't support true mixed precision training.

* Update requirements.opt.txt

* Update requirements.opt.txt

* Change map_location to be 'cpu' (OpenNMT#1461)

* Change map_location to be 'cpu'

If you are on a CPU-only machine, it will give an error otherwise. Model averaging should not require a GPU; moreover, it may be faster to use CPU rather than move all models to the GPU to average them.

* New apex amp API (OpenNMT#1465)

* use new apex amp API
* make apex opt_level as option

* bump 0.9.1 (OpenNMT#1466)

* Do not raise an error for missing validation data (OpenNMT#1467)

* fix incorrect script path in CONTRIBUTING.md (OpenNMT#1470) (OpenNMT#1472)

* Fix a potential IndexError when translating with replace_unk (OpenNMT#1469)

* Fix IndexError which happens with replace_unk, when the argmax of the attention is on the padding instead of a real source token

* add health endpoint to server.py (OpenNMT#1471)

* fix typo

* Minor change in MultiHeadedAttention  documentation (OpenNMT#1479)

* Minor change in documentation

* Optimize AAN transformer and small fixes (OpenNMT#1482)

* Optimize AAN transformer and small fixes
* Make use of FFN layer in AAN an option

* Implementing coverage loss of abisee (2017) (OpenNMT#1464)

* Implementing coverage loss of abisee (2017)
* fix lambda_coverage value

* Video captioning (OpenNMT#1409)

* Add feature extraction tool.
* Update preprocess.
* Add training and translation.
* Adapt transformer for video.
* Add tutorial to docs.
* Add folded val files for easier 'early stop.'
* Add and document transformer.

* ignore batch if over allowed tokens batch, add warning (OpenNMT#1490)

* allow implicit batch_size in translation_server (OpenNMT#1492)

* ensure building sequence mask on same device as lengths (OpenNMT#1494)

* add preprocess_opt in rest server (ZH) (OpenNMT#1493)

* fix build_dataset_iter in train_single (OpenNMT#1499)

* Use functions as preprocess / postprocess in REST server (OpenNMT#1505)

* add preprocess_opt in rest server (ZH)

* add preprocess and postprocess in rest server

* simplify

* fix function name

* fix function name v2

* [fix] remove implicit check in preprocess (OpenNMT#1507)

* [fix] remove implicit check in preprocess

There were some implicit checks on `src_vocab` and `tgt_vocab` in preprocessing.
This was creating some unwanted behavior when loading an existing vocab as a text file.

* fix typo

* add attention_dropout separate from dropout (OpenNMT#1512)

* add attention_dropout separate from dropout

* fix compatibility with models without attention_dropout (OpenNMT#1514)

* pytorch 1.2 compatibility - mask & bool tensor (OpenNMT#1527)

* Fix typo: traget -> target (OpenNMT#1537)

* Tokens batch for translation (OpenNMT#1545)

* wip translate batch tokens
* move logic in translator
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants