Skip to content

Bert-large model not attaining ~65% accuracy even after training till 52k timesteps! #10

Open

Description

We are using p100 and 25 gb ram to train the bert large model.
But when we tried to run the default code with bs=6 and num_batch_accumulated=4, we got cuda out of memory error.
Thus we changed it to bs=2 and num_batch_accumulated=8 as you said anything between 16...24 would perform similarly.
But now after training till 52000 timesteps, the maximum accuracy we got is ~59.6% at 44000th timestep.
Is it taking more time because we changed the batch_size? Or is there anything else we are missing out?

RESULT at 48000 and 52000 timestep:

Loading model from logdir/bert_run/bs=2,lr=7.4e-04,bert_lr=3.0e-06,end_lr=0e0,att=1/model_checkpoint-00048000
DB connections: 100% 166/166 [02:31<00:00, 1.10it/s]
100% 1034/1034 [05:45<00:00, 2.99it/s]
DB connections: 100% 166/166 [00:00<00:00, 448.81it/s]
Wrote eval results to logdir/bert_run/bs=2,lr=7.4e-04,bert_lr=3.0e-06,end_lr=0e0,att=1/ie_dirs/bert_run_true_1-step48000.eval
48000 0.5638297872340425

Loading model from logdir/bert_run/bs=2,lr=7.4e-04,bert_lr=3.0e-06,end_lr=0e0,att=1/model_checkpoint-00052000
DB connections: 100% 166/166 [00:00<00:00, 443.91it/s]
100% 1034/1034 [05:31<00:00, 3.12it/s]
DB connections: 100% 166/166 [00:00<00:00, 467.06it/s]
Wrote eval results to logdir/bert_run/bs=2,lr=7.4e-04,bert_lr=3.0e-06,end_lr=0e0,att=1/ie_dirs/bert_run_true_1-step52000.eval
52000 0.586073500967118

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions