Description
Currently, the benchmarking tools make use of a multi-processing to be sure that all memory is released after each measurement and makes use of the py3nvml
library to measure "peak GPU usage".
After some internal discussion, it is questionable whether the current code gives peak GPU memory usage. Thus, I ran a couple of experiments to see how torch benchmarking differs from py3nvml
. It is known that there are differences in the memory benchmarking as explained here: https://stackoverflow.com/questions/62257967/why-does-a-single-conv2d-with-10x10x3-take-up-850mb-of-gpu#_=_
For a comparison, the following command was run:
python run_benchmark.py --models gpt2 bert-base-cased xlnet-base-cased --no_speed --save_to_csv --batch_sizes 8 64
The environment information is the following:
transformers_version | 3.0.2 |
---|---|
framework | PyTorch |
use_torchscript | False |
framework_version | 1.6.0 |
python_version | 3.6.10 |
system | Linux |
cpu | x86_64 |
architecture | 64bit |
date | 2020-08-03 |
time | 14:47:20.956286 |
fp16 | False |
use_multiprocessing | True |
only_pretrain_model | False |
cpu_ram_mb | 32088 |
use_gpu | True |
num_gpus | 1 |
gpu | TITAN RTX |
gpu_ram_mb | 24217 |
gpu_power_watts | 280.0 |
gpu_performance_state | 0 |
use_tpu | False |
a) These are the results when running the command with the current code (py3nvml
):
model | batch_size | sequence_length | result |
---|---|---|---|
gpt2 | 8 | 8 | 1422 |
gpt2 | 8 | 32 | 1454 |
gpt2 | 8 | 128 | 1732 |
gpt2 | 8 | 512 | 2784 |
gpt2 | 64 | 8 | 1558 |
gpt2 | 64 | 32 | 2086 |
gpt2 | 64 | 128 | 4170 |
gpt2 | 64 | 512 | 12482 |
bert-base-cased | 8 | 8 | 1326 |
bert-base-cased | 8 | 32 | 1360 |
bert-base-cased | 8 | 128 | 1470 |
bert-base-cased | 8 | 512 | 2042 |
bert-base-cased | 64 | 8 | 1382 |
bert-base-cased | 64 | 32 | 1640 |
bert-base-cased | 64 | 128 | 2664 |
bert-base-cased | 64 | 512 | 7158 |
xlnet-base-cased | 8 | 8 | 1360 |
xlnet-base-cased | 8 | 32 | 1422 |
xlnet-base-cased | 8 | 128 | 1610 |
xlnet-base-cased | 8 | 512 | 2476 |
xlnet-base-cased | 64 | 8 | 1436 |
xlnet-base-cased | 64 | 32 | 1830 |
xlnet-base-cased | 64 | 128 | 3336 |
xlnet-base-cased | 64 | 512 | 10344 |
b) These are the results when using the function torch.cuda.max_memory_resevered(torch.cuda.current_device())
instead:
model | batch_size | sequence_length | result |
---|---|---|---|
gpt2 | 8 | 8 | 566 |
gpt2 | 8 | 32 | 598 |
gpt2 | 8 | 128 | 888 |
gpt2 | 8 | 512 | 1928 |
gpt2 | 64 | 8 | 702 |
gpt2 | 64 | 32 | 1230 |
gpt2 | 64 | 128 | 3314 |
gpt2 | 64 | 512 | 11626 |
bert-base-cased | 8 | 8 | 470 |
bert-base-cased | 8 | 32 | 504 |
bert-base-cased | 8 | 128 | 614 |
bert-base-cased | 8 | 512 | 1186 |
bert-base-cased | 64 | 8 | 526 |
bert-base-cased | 64 | 32 | 784 |
bert-base-cased | 64 | 128 | 1808 |
bert-base-cased | 64 | 512 | 6302 |
xlnet-base-cased | 8 | 8 | 504 |
xlnet-base-cased | 8 | 32 | 566 |
xlnet-base-cased | 8 | 128 | 754 |
xlnet-base-cased | 8 | 512 | 1620 |
xlnet-base-cased | 64 | 8 | 580 |
xlnet-base-cased | 64 | 32 | 974 |
xlnet-base-cased | 64 | 128 | 2480 |
xlnet-base-cased | 64 | 512 | 9488 |
One can see that the difference is always 856 MB (besides one exception where it is 868 MB). I ran the py3nvml
benchmark multiple times and the result is very stable.
The same holds true when benchmarking training.
=> I tend to think that the way the code is currently implemented, it actually gives the peak memory usage, even though I could not find proof in the https://github.com/fbcotter/py3nvml library.
@stas00 - what is your opinion on that?