Skip to content
This repository was archived by the owner on Jul 7, 2023. It is now read-only.
This repository was archived by the owner on Jul 7, 2023. It is now read-only.

Problems with transformer model training #1114

Open
@kudou1994

Description

@kudou1994

Description

Q1:Problems with transformer model training, when i use the transformer_base or transformer_big and 100,000 or 1,000,000 or 10,000,000 dataset. The loss function does not converge when it reaches 1.2(loss=1.2)
INFO:tensorflow:loss = 1.3547997, step = 1191000 (34.544 sec)
Q2: Why are the all data evaluation at 20 or 30 percent complete?
INFO:tensorflow:Restoring parameters from train/model.ckpt-120000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [10/100]
INFO:tensorflow:Evaluation [20/100]
INFO:tensorflow:Evaluation [30/100]
INFO:tensorflow:Finished evaluation at 2018-10-08-01:04:22
Q3: When i use universal_transformer, The loss function does not converge when it reaches 4.3(loss=4.3)
INFO:tensorflow:global_step/sec: 3.36792
INFO:tensorflow:loss = 4.30654, step = 120600 (29.692 sec)

Environment information

OS: <Ubuntu16.04>

$ pip freeze | grep tensor
tensor2tensor==1.9.0
tensorboard==1.10.0
tensorflow==1.10.1
tensorflow-gpu==1.10.1


$ python -V
Python 3.6.4 :: Anaconda, Inc.

For bugs: reproduction and error logs

# Steps to reproduce:
...
# Error logs:
INFO:tensorflow:loss = 1.3547997, step = 1191000 (34.544 sec)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions