Skip to content

Comparison different methods for benchmarking #6218

Closed
@patrickvonplaten

Description

Currently, the benchmarking tools make use of a multi-processing to be sure that all memory is released after each measurement and makes use of the py3nvml library to measure "peak GPU usage".

After some internal discussion, it is questionable whether the current code gives peak GPU memory usage. Thus, I ran a couple of experiments to see how torch benchmarking differs from py3nvml. It is known that there are differences in the memory benchmarking as explained here: https://stackoverflow.com/questions/62257967/why-does-a-single-conv2d-with-10x10x3-take-up-850mb-of-gpu#_=_

For a comparison, the following command was run:

python run_benchmark.py --models gpt2 bert-base-cased xlnet-base-cased --no_speed --save_to_csv --batch_sizes 8 64

The environment information is the following:

transformers_version 3.0.2
framework PyTorch
use_torchscript False
framework_version 1.6.0
python_version 3.6.10
system Linux
cpu x86_64
architecture 64bit
date 2020-08-03
time 14:47:20.956286
fp16 False
use_multiprocessing True
only_pretrain_model False
cpu_ram_mb 32088
use_gpu True
num_gpus 1
gpu TITAN RTX
gpu_ram_mb 24217
gpu_power_watts 280.0
gpu_performance_state 0
use_tpu False

a) These are the results when running the command with the current code (py3nvml):

model batch_size sequence_length result
gpt2 8 8 1422
gpt2 8 32 1454
gpt2 8 128 1732
gpt2 8 512 2784
gpt2 64 8 1558
gpt2 64 32 2086
gpt2 64 128 4170
gpt2 64 512 12482
bert-base-cased 8 8 1326
bert-base-cased 8 32 1360
bert-base-cased 8 128 1470
bert-base-cased 8 512 2042
bert-base-cased 64 8 1382
bert-base-cased 64 32 1640
bert-base-cased 64 128 2664
bert-base-cased 64 512 7158
xlnet-base-cased 8 8 1360
xlnet-base-cased 8 32 1422
xlnet-base-cased 8 128 1610
xlnet-base-cased 8 512 2476
xlnet-base-cased 64 8 1436
xlnet-base-cased 64 32 1830
xlnet-base-cased 64 128 3336
xlnet-base-cased 64 512 10344

b) These are the results when using the function torch.cuda.max_memory_resevered(torch.cuda.current_device()) instead:

model batch_size sequence_length result
gpt2 8 8 566
gpt2 8 32 598
gpt2 8 128 888
gpt2 8 512 1928
gpt2 64 8 702
gpt2 64 32 1230
gpt2 64 128 3314
gpt2 64 512 11626
bert-base-cased 8 8 470
bert-base-cased 8 32 504
bert-base-cased 8 128 614
bert-base-cased 8 512 1186
bert-base-cased 64 8 526
bert-base-cased 64 32 784
bert-base-cased 64 128 1808
bert-base-cased 64 512 6302
xlnet-base-cased 8 8 504
xlnet-base-cased 8 32 566
xlnet-base-cased 8 128 754
xlnet-base-cased 8 512 1620
xlnet-base-cased 64 8 580
xlnet-base-cased 64 32 974
xlnet-base-cased 64 128 2480
xlnet-base-cased 64 512 9488

One can see that the difference is always 856 MB (besides one exception where it is 868 MB). I ran the py3nvml benchmark multiple times and the result is very stable.
The same holds true when benchmarking training.

=> I tend to think that the way the code is currently implemented, it actually gives the peak memory usage, even though I could not find proof in the https://github.com/fbcotter/py3nvml library.

@stas00 - what is your opinion on that?

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions