Problems with transformer model training #1114
Description
Description
Q1:Problems with transformer model training, when i use the transformer_base or transformer_big and 100,000 or 1,000,000 or 10,000,000 dataset. The loss function does not converge when it reaches 1.2(loss=1.2)
INFO:tensorflow:loss = 1.3547997, step = 1191000 (34.544 sec)
Q2: Why are the all data evaluation at 20 or 30 percent complete?
INFO:tensorflow:Restoring parameters from train/model.ckpt-120000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [10/100]
INFO:tensorflow:Evaluation [20/100]
INFO:tensorflow:Evaluation [30/100]
INFO:tensorflow:Finished evaluation at 2018-10-08-01:04:22
Q3: When i use universal_transformer, The loss function does not converge when it reaches 4.3(loss=4.3)
INFO:tensorflow:global_step/sec: 3.36792
INFO:tensorflow:loss = 4.30654, step = 120600 (29.692 sec)
Environment information
OS: <Ubuntu16.04>
$ pip freeze | grep tensor
tensor2tensor==1.9.0
tensorboard==1.10.0
tensorflow==1.10.1
tensorflow-gpu==1.10.1
$ python -V
Python 3.6.4 :: Anaconda, Inc.
For bugs: reproduction and error logs
# Steps to reproduce:
...
# Error logs:
INFO:tensorflow:loss = 1.3547997, step = 1191000 (34.544 sec)