-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix onmt as library example #1292
Conversation
fixing some pep8 stuff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the Pr,
I just suggested bunch of pep8 fixes to make it a bit more compliant.
As I've no direct write access to that PR, I opened and PR on your branch, plz merge it on your repo before we merge this one on master.
a bit more pep8 compliant version
Thanks for the formatting! |
Thanks for your contribution Do you think we could have a more meaning full toyset / example that does not output only unk ? |
…ix/library-example-docs
Thank you for your comments. I also think it would be better to output a real translation, but we should be able to get it from this data and this model as it was the case before code changes. |
docs/source/Library.md
Outdated
from itertools import chain | ||
train_data_file = "data/data.train.0.pt" | ||
valid_data_file = "data/data.valid.0.pt" | ||
dataset_fields = dict(chain.from_iterable(vocab_fields.values())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not very intuitive that iterators expect "fields" to be a dict[str, List[Tuple[str, Field]]] whereas Tranlator or TranslatorBuilder expect "fields" to be a dict[str, list[tuple[str, torchtext.data.Field]]].
Maybe choose discriminant names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or they should all expect the same structure. See this comment
https://github.com/OpenNMT/OpenNMT-py/blob/master/onmt/train_single.py#L139
The docs for the format that DatasetLazyIter expects is evidently wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should they? As for now, giving DatasetLazyIter a dict[str, torchtext.data.Field] actually makes the code work, while giving it the original dict[str, list[tuple[str, torchtext.data.Field]]] does not allow training to start because the structure is not correct. Could you expand a bit more on the logic behind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're correct that DatasetLazyIter actually needs dict[str, torchtext.data.Field]. The docs claim DatasetLazyIter expects dict[str, list[tuple[str, torchtext.data.Field]]] (I wrote those docs and I got it wrong because I incorrectly assumed it would take the same structure as the other field
arguments, sorry!).
Now, the comment in train_single.py
that I linked above says that DatasetLazyIter
expecting a different structure is (or, should have been) a temporary thing. It's a trivial fix. Move the dict(itertools.chain.from_iterable(fields.values()))
business inside DatasetLazyIter
.
But, it gets weird because when torchtext says fields
, e.g. in Dataset, they're talking about list[tuple[str, Field]]
. Onmt uses the extra level of nesting because onmt text features (POS tags, NER, etc) used to be handled as separate fields. Now, that's unnecessary because of TextMultiField
.
All to say, yes they should be named different and the docs should be updated OR they should expect the same thing (either the extra level of nesting or, more elegantly, without it so that it matches torchtext).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, torchtext is inconsistent about what it means too:
https://github.com/pytorch/text/blob/master/torchtext/data/dataset.py#L23 (fields is a dict[str, Field])
https://github.com/pytorch/text/blob/master/torchtext/data/example.py#L19 (fields should be dict[str, list[tuple[str, Field]]] or dict[str, tuple[str, Field]])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, I let it that way at the moment in the example, just expliciting a bit more, let me know when things are set on how onmt handles this.
@elisemicho I think I got it. The root of the problem is preprocessing with 1000 tokens in the input and output vocabs. Bump that to 10000:
That alone will fix the <unk>-iness, but the predictions will mostly be empty or periods. If you want a more realistic example like @vince62s is saying, try this:
Here's my code: from itertools import chain
import torch
import torch.nn as nn
import onmt
import onmt.inputters
import onmt.modules
import onmt.utils
import onmt.translate
from onmt.utils.logging import init_logger
init_logger(None)
def main():
vocab_fields = torch.load("data/data.vocab.pt")
src_text_field = vocab_fields["src"][0][1].base_field
src_vocab = src_text_field.vocab
src_padding = src_vocab.stoi[src_text_field.pad_token]
tgt_text_field = vocab_fields['tgt'][0][1].base_field
tgt_vocab = tgt_text_field.vocab
tgt_padding = tgt_vocab.stoi[tgt_text_field.pad_token]
emb_size = 100
rnn_size = 500
# Specify the core model.
encoder_embeddings = onmt.modules.Embeddings(emb_size, len(src_vocab),
word_padding_idx=src_padding)
encoder = onmt.encoders.RNNEncoder(hidden_size=rnn_size, num_layers=1,
rnn_type="LSTM", bidirectional=True,
embeddings=encoder_embeddings)
decoder_embeddings = onmt.modules.Embeddings(emb_size, len(tgt_vocab),
word_padding_idx=tgt_padding)
decoder = onmt.decoders.decoder.InputFeedRNNDecoder(
hidden_size=rnn_size, num_layers=1, bidirectional_encoder=True,
rnn_type="LSTM", embeddings=decoder_embeddings)
model = onmt.models.model.NMTModel(encoder, decoder)
# you could use torch.device(str) instead of str if you want
dev = "cuda" if torch.cuda.is_available() else "cpu"
# Specify the tgt word generator and loss computation module
model.generator = nn.Sequential(
nn.Linear(rnn_size, len(tgt_vocab)),
nn.LogSoftmax(dim=-1))
model.to(dev)
loss = onmt.utils.loss.NMTLossCompute(
criterion=nn.NLLLoss(ignore_index=tgt_padding, reduction='sum'),
generator=model.generator)
lr = 1
torch_optimizer = torch.optim.SGD(model.parameters(), lr=lr)
optim = onmt.utils.optimizers.Optimizer(
torch_optimizer, learning_rate=lr, max_grad_norm=2)
# Load some data
train_data_file = "data/data.train.0.pt"
valid_data_file = "data/data.valid.0.pt"
dataset_fields = dict(chain.from_iterable(vocab_fields.values()))
train_iter = onmt.inputters.inputter.DatasetLazyIter(
dataset_paths=[train_data_file],
fields=dataset_fields,
batch_size=50,
batch_size_multiple=1,
batch_size_fn=None,
device=dev,
is_train=True,
repeat=True)
valid_iter = onmt.inputters.inputter.DatasetLazyIter(
dataset_paths=[valid_data_file],
fields=dataset_fields,
batch_size=10,
batch_size_multiple=1,
batch_size_fn=None,
device=dev,
is_train=False,
repeat=False)
report_mgr = onmt.utils.ReportMgr(
report_every=50, start_time=None, tensorboard_writer=None)
trainer = onmt.Trainer(model=model,
train_loss=loss,
valid_loss=loss,
optim=optim,
report_manager=report_mgr)
trainer.train(train_iter=train_iter,
train_steps=400,
valid_iter=valid_iter,
valid_steps=200)
src_reader = onmt.inputters.str2reader["text"]
tgt_reader = onmt.inputters.str2reader["text"]
scorer = onmt.translate.GNMTGlobalScorer(alpha=0.7,
beta=0,
length_penalty="avg",
coverage_penalty="none")
xdev = 0 if torch.cuda.is_available() else -1
translator = onmt.translate.Translator(model=model,
fields=vocab_fields,
src_reader=src_reader,
tgt_reader=tgt_reader,
global_scorer=scorer,
gpu=xdev)
builder = onmt.translate.TranslationBuilder(
data=torch.load(valid_data_file),
fields=vocab_fields)
for batch in valid_iter:
trans_batch = translator.translate_batch(
batch=batch, src_vocabs=[src_vocab],
attn_debug=False)
translations = builder.from_batch(trans_batch)
for trans in translations:
print(trans.log(0))
if __name__ == "__main__":
main() |
Thank you for the improvements, indeed increasing the scale is a good idea, however I do reproduce the many-unks output with your code. Do you get a better output? |
@elisemicho Did you preprocess it with 10000 tokens instead of 1000? That's key here. |
Yes, just to make sure I have tried it again, perplexity and prediction scores look better but unks are still prevalent.
|
I mean there's stochasticity. Some runs will be better than others.
|
Not sure if you have some new changes to make after #1299 |
@vince62s Indeed I had some changes to make, it's done and ready to be merged. |
reverting to understand why travis build broke. |
This example is great for one to dig in how things work in OpenNMT but there has been quite a lot of changes recently in datasets, vocabs, losses, optimizers, iterators, trainer, translator, logging so that the provided example was not working anymore.
I made it work again for training, but I am missing something for translating as only unks are ouput.
Can you help correct this last part and check the modifications make a correct usage of the functions?
Thanks!