Description
I encountered some confusion when I followed the guide--https://github.com/IntelAI/models/tree/master/benchmarks/language_modeling/tensorflow/bert_large to run training workload.
Running command:
nohup python ./launch_benchmark.py \
--model-name=bert_large \
--precision=fp32 \
--mode=training \
--framework=tensorflow \
--batch-size=24 --mpi_num_processes=2 \
--benchmark-only \
--docker-image intel/intel-optimized-tensorflow:2.3.0 \
--volume $BERT_LARGE_DIR:$BERT_LARGE_DIR \
--volume $SQUAD_DIR:$SQUAD_DIR \
--data-location=$BERT_LARGE_DIR \
--num-intra-threads=26 \
--num-inter-threads=1 \
-- train-option=SQuAD DEBIAN_FRONTEND=noninteractive config_file=$BERT_LARGE_DIR/bert_config.json init_checkpoint=$BERT_LARGE_DIR/bert_model.ckpt vocab_file=$BERT_LARGE_DIR/vocab.txt train_file=$SQUAD_DIR/train-v1.1.json predict_file=$SQUAD_DIR/dev-v1.1.json do-train=True learning-rate=1.5e-5 max-seq-length=384 do_predict=True warmup-steps=0 num_train_epochs=2 doc_stride=128 do_lower_case=False experimental-gelu=False mpi_workers_sync_gradients=True >> training-0609 &
Result:
INFO:tensorflow:Writing nbest to: /workspace/benchmarks/common/tensorflow/logs/nbest_predictions.json
I0610 01:09:58.730417 140427424720704 run_squad.py:798] Writing nbest to: /workspace/benchmarks/common/tensorflow/logs/nbest_predictions.json
INFO:tensorflow:Processing example: 9000
I0610 01:13:27.192351 140160153200448 run_squad.py:1363] Processing example: 9000
INFO:tensorflow:Processing example: 10000
I0610 01:17:27.623694 140160153200448 run_squad.py:1363] Processing example: 10000
INFO:tensorflow:prediction_loop marked as finished
I0610 01:20:36.625470 140160153200448 error_handling.py:115] prediction_loop marked as finished
INFO:tensorflow:prediction_loop marked as finished
I0610 01:20:36.625671 140160153200448 error_handling.py:115] prediction_loop marked as finished
INFO:tensorflow:Writing predictions to: /workspace/benchmarks/common/tensorflow/logs/1/predictions.json
I0610 01:20:36.625791 140160153200448 run_squad.py:797] Writing predictions to: /workspace/benchmarks/common/tensorflow/logs/1/predictions.json
INFO:tensorflow:Writing nbest to: /workspace/benchmarks/common/tensorflow/logs/1/nbest_predictions.json
I0610 01:20:36.625833 140160153200448 run_squad.py:798] Writing nbest to: /workspace/benchmarks/common/tensorflow/logs/1/nbest_predictions.json
I didn’t see the “throughput((num_processed_examples-threshod_examples)/Elapsedtime)” information like inference workload from the training log. I also read the script code: models/models/language_modeling/tensorflow/bert_large/training/fp32/run_squad.py, I have not found about “throughput”. But the ./models/models/language_modeling/tensorflow/bert_large/inference/run_squad.py used by inference has code about ” throughput((num_processed_examples-threshod_examples)/Elapsedtime)”.
So how to evaluate the performance number of Bert-Large training. There is neither "throughput" nor "Elapsedtime" in the log and running script?
Thanks