How to evaluate the performance number of Bert-Large training

I encountered some confusion when I followed the guide--https://github.com/IntelAI/models/tree/master/benchmarks/language_modeling/tensorflow/bert_large to run training workload.

Running command:

```
nohup python ./launch_benchmark.py \
    --model-name=bert_large \
    --precision=fp32 \
    --mode=training \
    --framework=tensorflow \
    --batch-size=24 --mpi_num_processes=2 \
    --benchmark-only \
    --docker-image intel/intel-optimized-tensorflow:2.3.0 \
    --volume $BERT_LARGE_DIR:$BERT_LARGE_DIR \
    --volume $SQUAD_DIR:$SQUAD_DIR \
    --data-location=$BERT_LARGE_DIR \
    --num-intra-threads=26 \
    --num-inter-threads=1 \
-- train-option=SQuAD  DEBIAN_FRONTEND=noninteractive   config_file=$BERT_LARGE_DIR/bert_config.json init_checkpoint=$BERT_LARGE_DIR/bert_model.ckpt     vocab_file=$BERT_LARGE_DIR/vocab.txt train_file=$SQUAD_DIR/train-v1.1.json     predict_file=$SQUAD_DIR/dev-v1.1.json      do-train=True learning-rate=1.5e-5   max-seq-length=384     do_predict=True warmup-steps=0     num_train_epochs=2     doc_stride=128      do_lower_case=False     experimental-gelu=False     mpi_workers_sync_gradients=True >> training-0609 &

```
Result:

```
INFO:tensorflow:Writing nbest to: /workspace/benchmarks/common/tensorflow/logs/nbest_predictions.json
I0610 01:09:58.730417 140427424720704 run_squad.py:798] Writing nbest to: /workspace/benchmarks/common/tensorflow/logs/nbest_predictions.json
INFO:tensorflow:Processing example: 9000
I0610 01:13:27.192351 140160153200448 run_squad.py:1363] Processing example: 9000
INFO:tensorflow:Processing example: 10000
I0610 01:17:27.623694 140160153200448 run_squad.py:1363] Processing example: 10000
INFO:tensorflow:prediction_loop marked as finished
I0610 01:20:36.625470 140160153200448 error_handling.py:115] prediction_loop marked as finished
INFO:tensorflow:prediction_loop marked as finished
I0610 01:20:36.625671 140160153200448 error_handling.py:115] prediction_loop marked as finished
INFO:tensorflow:Writing predictions to: /workspace/benchmarks/common/tensorflow/logs/1/predictions.json
I0610 01:20:36.625791 140160153200448 run_squad.py:797] Writing predictions to: /workspace/benchmarks/common/tensorflow/logs/1/predictions.json
INFO:tensorflow:Writing nbest to: /workspace/benchmarks/common/tensorflow/logs/1/nbest_predictions.json
I0610 01:20:36.625833 140160153200448 run_squad.py:798] Writing nbest to: /workspace/benchmarks/common/tensorflow/logs/1/nbest_predictions.json

```
I didn’t see the “throughput((num_processed_examples-threshod_examples)/Elapsedtime)” information like inference workload  from the training log. I also read the script code: models/models/language_modeling/tensorflow/bert_large/training/fp32/run_squad.py, I have not found about “throughput”. But the ./models/models/language_modeling/tensorflow/bert_large/inference/run_squad.py used by inference has code about ” throughput((num_processed_examples-threshod_examples)/Elapsedtime)”.


So how  to evaluate the performance number of Bert-Large training. There is neither "throughput" nor "Elapsedtime" in the log and running script?


@ashahba @dmsuehir 

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to evaluate the performance number of Bert-Large training #83

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to evaluate the performance number of Bert-Large training #83

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions