-
Notifications
You must be signed in to change notification settings - Fork 2k
GNMT on tensorflow/nmt vs. GNMT on google/seq2seq #131
Comments
Hi jongsae, I think gnmt_v2 is a main factor then SGD. normed_bahdanau is better than bahdanau attention; scaled_luong is good, but somehow I couldn't get it to work with gnmt_v2. For optimizer, what I observed personally is Adam makes things easier to train but if you can manage to train with SGD with large learning rate, you will get better result! In fact, in all my NMT papers in 2014-2016, I used a pretty universal set of hyperparameters: sgd, learning rate 1.0, uniform init 0.1, grad norm 5, dropout 0.2 :) Hope that helps! |
Hello @lmthang, I am developing an NMT system using Both the implementations are a product of a Google team. I was wondering why both of them exists. |
Hi @ssokhey, The google/seq2seq was developed to be general purpose with usage of Estimator & various customizations / add-ons. The tensorflow/nmt was initially developed under a teaching perspective that avoids too high-level APIs like Estimator that abstracts away many details. Over the course of development, we also managed to replicate Google's NMT system with very good performance (outperforming the google/seq2seq too, see https://github.com/tensorflow/nmt#wmt-english-german--full-comparison). I'd recommend using tensorflow/nmt as it is still being regularly maintained and can be used with newer versions of TF. |
Thanks a lot! @lmthang |
Hi |
Hey @frajos100 Can you share the parameter setting you're using and also what is the dataset size? |
The Parameter settings is same as the wmt16_gnmt_4_layer.json present at nmt / standard_hparams / wmt16_gnmt_4_layer.json on https://github.com/tensorflow/nmt |
HI ssokhey , |
Hi, I am trying to reproduce the training results I generated using google/seq2seq on tensorflow/nmt.
I noticed that standard hyperparams provided here lead to much higher BLEU score (15.9 vs. 21.45 BLEU) when the same number of steps (80K) are trained. Is it because of the algorithmic changes such as normed_bahdanau and gnmt_v2, or is it because of the optimized implementation of NMT? Or because of any reasons else?
One more thing is, it uses SGD instead of Adam, which was the default of google/seq2seq. Moreover, the used learning rate is surprisingly high (1.0). Maybe the optimizer changes affected the training curve? I think Adam is more commonly used these days, so why is SGD selected in this case?
I would appreciate any explanations or comments that help me understand the algorithmic and implementational differences between these two trainings.
The text was updated successfully, but these errors were encountered: