Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to evaluate the performance number of Bert-Large training #83

Open
zhixingheyi-tian opened this issue Jun 15, 2021 · 2 comments
Open

Comments

@zhixingheyi-tian
Copy link

I encountered some confusion when I followed the guide--https://github.com/IntelAI/models/tree/master/benchmarks/language_modeling/tensorflow/bert_large to run training workload.

Running command:

nohup python ./launch_benchmark.py \
    --model-name=bert_large \
    --precision=fp32 \
    --mode=training \
    --framework=tensorflow \
    --batch-size=24 --mpi_num_processes=2 \
    --benchmark-only \
    --docker-image intel/intel-optimized-tensorflow:2.3.0 \
    --volume $BERT_LARGE_DIR:$BERT_LARGE_DIR \
    --volume $SQUAD_DIR:$SQUAD_DIR \
    --data-location=$BERT_LARGE_DIR \
    --num-intra-threads=26 \
    --num-inter-threads=1 \
-- train-option=SQuAD  DEBIAN_FRONTEND=noninteractive   config_file=$BERT_LARGE_DIR/bert_config.json init_checkpoint=$BERT_LARGE_DIR/bert_model.ckpt     vocab_file=$BERT_LARGE_DIR/vocab.txt train_file=$SQUAD_DIR/train-v1.1.json     predict_file=$SQUAD_DIR/dev-v1.1.json      do-train=True learning-rate=1.5e-5   max-seq-length=384     do_predict=True warmup-steps=0     num_train_epochs=2     doc_stride=128      do_lower_case=False     experimental-gelu=False     mpi_workers_sync_gradients=True >> training-0609 &

Result:

INFO:tensorflow:Writing nbest to: /workspace/benchmarks/common/tensorflow/logs/nbest_predictions.json
I0610 01:09:58.730417 140427424720704 run_squad.py:798] Writing nbest to: /workspace/benchmarks/common/tensorflow/logs/nbest_predictions.json
INFO:tensorflow:Processing example: 9000
I0610 01:13:27.192351 140160153200448 run_squad.py:1363] Processing example: 9000
INFO:tensorflow:Processing example: 10000
I0610 01:17:27.623694 140160153200448 run_squad.py:1363] Processing example: 10000
INFO:tensorflow:prediction_loop marked as finished
I0610 01:20:36.625470 140160153200448 error_handling.py:115] prediction_loop marked as finished
INFO:tensorflow:prediction_loop marked as finished
I0610 01:20:36.625671 140160153200448 error_handling.py:115] prediction_loop marked as finished
INFO:tensorflow:Writing predictions to: /workspace/benchmarks/common/tensorflow/logs/1/predictions.json
I0610 01:20:36.625791 140160153200448 run_squad.py:797] Writing predictions to: /workspace/benchmarks/common/tensorflow/logs/1/predictions.json
INFO:tensorflow:Writing nbest to: /workspace/benchmarks/common/tensorflow/logs/1/nbest_predictions.json
I0610 01:20:36.625833 140160153200448 run_squad.py:798] Writing nbest to: /workspace/benchmarks/common/tensorflow/logs/1/nbest_predictions.json

I didn’t see the “throughput((num_processed_examples-threshod_examples)/Elapsedtime)” information like inference workload from the training log. I also read the script code: models/models/language_modeling/tensorflow/bert_large/training/fp32/run_squad.py, I have not found about “throughput”. But the ./models/models/language_modeling/tensorflow/bert_large/inference/run_squad.py used by inference has code about ” throughput((num_processed_examples-threshod_examples)/Elapsedtime)”.

So how to evaluate the performance number of Bert-Large training. There is neither "throughput" nor "Elapsedtime" in the log and running script?

@ashahba @dmsuehir

Thanks

@dmsuehir
Copy link
Contributor

The BERT large squad training log will have values like INFO:tensorflow:examples/sec: .... This number can be multiplied by the number of MPI processes (in your example, that's 2 since you have --mpi_num_processes=2) to get the total examples per second.

ashahba pushed a commit that referenced this issue Apr 1, 2022
…Torch SPR) (#83)

* Add specs, docs, and quickstarts for BERT inference and training

* Add build and run scripts

* Update mount paths

* update base FROM

* Update spec to add quickstarts

* update wrapper to include run.sh

* Update path

* Update pip install -y

* Update bert installs

* Regenerate dockerfile

* Update dockerfile for bert train

* Update installs

* Doc updates

* Update dockerfile and run after testing training

* remove bert inf files from dockerfile

* Small doc updates

* Add shm-size 8G

* Fix error message

* Fix env var usages in build.sh

* Regenerate dockerfiles

* update conda activate partial

* Add build tools

* quickstart script updates

* Clarify dataset download instructions and switch CHECKPOINT_DIR to CONFIG_FILE

* Update quickstart and docs to have phase 2 use checkpoints from phase 1

* Fix script
@sramakintel
Copy link
Contributor

@zhixingheyi-tian can you try our latest optimizations for tensorflow bert-large by referring to the link here https://www.intel.com/content/www/us/en/developer/articles/containers/cpu-reference-model-containers.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants