Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add phrase_table translation argument #1370

Merged
merged 4 commits into from
Mar 28, 2019
Merged

Add phrase_table translation argument #1370

merged 4 commits into from
Mar 28, 2019

Conversation

ymoslem
Copy link
Contributor

@ymoslem ymoslem commented Mar 27, 2019

If phrase_table is provided (with replace_unk), it will look up the identified source token and give the corresponding target token. If it is not provided (or the identified source token does not exist in the table), then it will copy the source token. Tested with both translate.py and server.py (with conf.json).

The default behaviour of the -replace_unk option is substituting <unk> (for an unknown word) with the source word that has the highest attention weight. Adding the option -phrase_table as well, it will look up in the phrase table file for a possible translation instead. If a valid replacement is not found, only then the source token will be copied.

The phrase table is a file with one translation per line in the format:
source|||target
Where source and target are case sensitive and single tokens.

Example with translate.py:
python3 OpenNMT-py/translate.py -model available_models/my.model_step_100000.pt -src source.txt -output prep.txt -replace_unk -phrase_table phrase-table.txt

Example with server.py:
python3 OpenNMT-py/server.py --ip "0.0.0.0" --port 5000 --url_root "/translator" --config available_models/conf.json

curl -i -X POST -H "Content-Type: application/json" -d '[{"src": "this is a test for model 100", "id": 100}]' http://127.0.0.1:5000/translator/translate

... where conf.json is:

{
    "models_root": "/home/available_models",
    "models": [
        {   
            "id": 100,
            "model": "my.model_step_100000.pt",
            "timeout": 600,
            "on_timeout": "to_cpu",
            "load": true,
            "opt": {
                "beam_size": 1,
                "replace_unk": true,
                "phrase_table": "/home/available_models/phrase-table.txt"
            }
        }  
    ]   
}

ymoslem added 4 commits March 27, 2019 13:03
If phrase_table is provided (with replace_unk), it will look up the identified source token and give the corresponding target token. If it is not provided (or the identified source token does not exist in the table), then it will copy the source token.
Copy link
Contributor Author

@ymoslem ymoslem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed flake8 issues.

@ymoslem
Copy link
Contributor Author

ymoslem commented Mar 28, 2019

@vince62s @srush Could you please check. Thanks!

@vince62s
Copy link
Member

Thanks for this contribution!
My first comments, but @francoishernandez will comment more when he finds the time.

This is single token replacement. Nowadays, most people use BPE/Subwords, so it would be even better to use multiple tokens replacement.

Also, I think (but @francoishernandez @guillaumekln will confirm) this would work for LSTM only. For the transformer it might be better to use a guided alignement methid as in openNMT-tf.

@ymoslem
Copy link
Contributor Author

ymoslem commented Mar 28, 2019

@vince62s Thanks, Vincent! I agree with you that it is much better to be multiple tokens replacement. I just wanted to start by imitating the current behaviour of phrase_table in the Lua version as I thought it is useful is some cases.

So how do you think I should proceed? Do you think it is worth adjusting the code to work on multiple tokens even if it will not work with the transformer for now? (I have not tested this either)

Thanks again for your time and insights!

@vince62s
Copy link
Member

It's ok, we'll merge this one as is.
Just bringing attention to users it won't work with the Transformer.

@vince62s vince62s merged commit f09cc8c into OpenNMT:master Mar 28, 2019
@ymoslem
Copy link
Contributor Author

ymoslem commented Mar 28, 2019

@vince62s Thanks, Vincent, for merging it! If I am going to work on multiple tokens replacement:

1- Should it be a new argument like phrase_table_multiple or the same argument phrase_table?
2- How many maximum ngrams should it work on; i.e. what is the maximum sequential source words can be considered as a phrase?
3- Should it work on the unknowns only or on the source before machine translation?

Thanks!

marekstrong pushed a commit to marekstrong/OpenNMT-py that referenced this pull request Jun 10, 2019
* advanced noam with decay and accum scheduler

Add phrase_table translation argument (OpenNMT#1370)

* Add phrase_table translation argument

If phrase_table is provided (with replace_unk), it will look up the identified source token and give the corresponding target token. If it is not provided (or the identified source token does not exist in the table), then it will copy the source token.

update

collected

update

update

policy

update

auto and conf

guide

update

requirements.txt

collect more data

update

update
rishibommasani added a commit to rishibommasani/OpenNMT-py that referenced this pull request Aug 29, 2019
* advanced noam with decay and accum scheduler (OpenNMT#1367)

* advanced noam with decay and accum scheduler

* Add phrase_table translation argument (OpenNMT#1370)

* Add phrase_table translation argument

If phrase_table is provided (with replace_unk), it will look up the identified source token and give the corresponding target token. If it is not provided (or the identified source token does not exist in the table), then it will copy the source token.

* Have EnsembleDecoder set attentional property. (OpenNMT#1381)

* More efficient embeddings_to_torch.py (OpenNMT#1372)

* Update embeddings_to_torch.py to be more memory efficient by only loading vectors which are present in the vocab into memory.

* remove dead code and flake8 violations introduced with 57cefb7

* update docs of using Glove embeddings. Fix spelling error

* write attention debug to log file (OpenNMT#1384)

* Better handle Cuda OOM with overflow batches (OpenNMT#1385)

* Added earlystopping mechanism (OpenNMT#1389)

* Added earlystopping mechanism
* Fixed earlystopping multi-gpu stoppage

* check vocab files exist at start of preprocessing (OpenNMT#1396)

* Avoid padding indices in MeanEncoder (OpenNMT#1398)

* We avoid padding while mean pooling
* placed batch dimension first for bmm
* replaced accidentally deleted line

* fix Runtime error in Library tutorial (OpenNMT#1399)

* Check -gpu_ranks option to ensure saving a model (OpenNMT#1407)

* Check -gpu_ranks option to ensure saving a model
* split condition to check -gpu_ranks inconsistency

* add src or tgt min frequency to counter value (OpenNMT#1414)

* fix typo (OpenNMT#1416)

* fix goldscore OpenNMT#1383 (OpenNMT#1423)

* fix OpenNMT#1383

* fix gold score only

* Upgrade Travis to Torch 1.1 (OpenNMT#1426)

* Introduce dropout scheduler (OpenNMT#1421)

* add update_dropout methods approx. everywhere, dropout scheduler
* more meaningful log
* forgot some layers in audio_encoder

* Preprocessing: faster build vocab + multiple weighted datasets (OpenNMT#1413)

* handle multiple training corpora and enable weighting
* move fields vocab building logic in function
* fix device handling MultipleDatasetIterator
* fix multi/yield_raw_batch parameter DatasetLazyIter
* update FAQ.md
* add -pool_factor option
* reduce pool_factor for travis runs

* bump version (OpenNMT#1434)

* make MultipleDatasetIterator only if necessary (OpenNMT#1436)

* Update README.md (OpenNMT#1437)

* small fix multi when common root in data_ids (OpenNMT#1444)

* do not overwrite pt vocab when preprocessing again (OpenNMT#1447)

* trim vocab(s) before saving checkpoint (OpenNMT#1453)

* Using Producer-Consumer for batches (OpenNMT#1450)

* Working queues on multi-GPU on text and audio
* Working quite well, even with dynamic_dict
* Remove explicit garbage collect making some queue hang and other fixes
* fix process not ending
* properly set random seed and fill queues sequentially
* make queues work with distributed training

* [fix] Make queue.put() blocking again (OpenNMT#1455)

Fix OpenNMT#1454 .

* Clarify mixed precision training support (OpenNMT#1458)

Change the wording to avoid confusion. Mixed precision ensures both higher arithmetic throughput and numerical stability, not exactly synonymous to pure half-precision/FP16 training. Also add mentioning of tensor cores since older generation GPUs without tensor cores don't support true mixed precision training.

* Update requirements.opt.txt

* Update requirements.opt.txt

* Change map_location to be 'cpu' (OpenNMT#1461)

* Change map_location to be 'cpu'

If you are on a CPU-only machine, it will give an error otherwise. Model averaging should not require a GPU; moreover, it may be faster to use CPU rather than move all models to the GPU to average them.

* New apex amp API (OpenNMT#1465)

* use new apex amp API
* make apex opt_level as option

* bump 0.9.1 (OpenNMT#1466)

* Do not raise an error for missing validation data (OpenNMT#1467)

* fix incorrect script path in CONTRIBUTING.md (OpenNMT#1470) (OpenNMT#1472)

* Fix a potential IndexError when translating with replace_unk (OpenNMT#1469)

* Fix IndexError which happens with replace_unk, when the argmax of the attention is on the padding instead of a real source token

* add health endpoint to server.py (OpenNMT#1471)

* fix typo

* Minor change in MultiHeadedAttention  documentation (OpenNMT#1479)

* Minor change in documentation

* Optimize AAN transformer and small fixes (OpenNMT#1482)

* Optimize AAN transformer and small fixes
* Make use of FFN layer in AAN an option

* Implementing coverage loss of abisee (2017) (OpenNMT#1464)

* Implementing coverage loss of abisee (2017)
* fix lambda_coverage value

* Video captioning (OpenNMT#1409)

* Add feature extraction tool.
* Update preprocess.
* Add training and translation.
* Adapt transformer for video.
* Add tutorial to docs.
* Add folded val files for easier 'early stop.'
* Add and document transformer.

* ignore batch if over allowed tokens batch, add warning (OpenNMT#1490)

* allow implicit batch_size in translation_server (OpenNMT#1492)

* ensure building sequence mask on same device as lengths (OpenNMT#1494)

* add preprocess_opt in rest server (ZH) (OpenNMT#1493)

* fix build_dataset_iter in train_single (OpenNMT#1499)

* Use functions as preprocess / postprocess in REST server (OpenNMT#1505)

* add preprocess_opt in rest server (ZH)

* add preprocess and postprocess in rest server

* simplify

* fix function name

* fix function name v2

* [fix] remove implicit check in preprocess (OpenNMT#1507)

* [fix] remove implicit check in preprocess

There were some implicit checks on `src_vocab` and `tgt_vocab` in preprocessing.
This was creating some unwanted behavior when loading an existing vocab as a text file.

* fix typo

* add attention_dropout separate from dropout (OpenNMT#1512)

* add attention_dropout separate from dropout

* fix compatibility with models without attention_dropout (OpenNMT#1514)

* pytorch 1.2 compatibility - mask & bool tensor (OpenNMT#1527)

* Fix typo: traget -> target (OpenNMT#1537)

* Tokens batch for translation (OpenNMT#1545)

* wip translate batch tokens
* move logic in translator
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants