Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] [CUDA] TVM does not release all the GPU memory and does not turn off the maximum performance mode after the inference has been completed. #11307

Open
KJlaccHoeUM9l opened this issue May 13, 2022 · 0 comments

Comments

@KJlaccHoeUM9l
Copy link
Contributor

Research that led to the problem

While conducting measurement experiments to evaluate performance on a GPU (NVidia Tesla T4), we noticed that the temperature of the GPU affects the final performance evaluation.

Below is a demonstration of this effect using GPT-2 as an example.

GPU Temperature (°C) Performance (ms)
~35 3,5713
~45 3,6958
~60 3,7856
~70 3,8195
~80 3,8904

* This table is not aimed at demonstrating the exact values, it only shows the general dynamics.

The table shows that when the GPU goes from a cold state before run (35°C) to a state of long operation (80°C), there is a decrease in performance by ~9%.

The main suggestion about the reason for this behavior was related to the drop in the frequency of the GPU when it was heated.

Using the officially NVidia tool (nvidia-smi -q -d CLOCK), we can find out the maximum GPU frequency:

Max Customer Boost Clocks
        Graphics                          : 1590 MHz

Also, using the same tool (nvidia-smi dmon), we can trace the relationship between GPU temperature and its frequency (third and last columns):

# gpu   pwr gtemp mtemp    sm   mem   enc   dec  mclk  pclk
# Idx     W     C     C     %     %     %     %   MHz   MHz
    0    69    63     -    89    56     0     0  5000  1545
    0    70    63     -    89    55     0     0  5000  1545
    0    69    64     -    89    55     0     0  5000  1530
    0    69    64     -    89    55     0     0  5000  1515

When the temperature changes from 35°C to 80°C, the GPU frequency drops from 1590 MHz to ~1430 MHz (~10%).

Based on this, we can conclude that the drop in performance for the GPT-2 is associated with a drop in the frequency of the GPU when it is heated.

Description of the problem

In order to reduce the effect of heat on performance, it is necessary to make sure that the GPU cools down at those moments when its frequency begins to decrease due to heat. The easiest way is to call sleep after the next inference is completed, when the temperature reaches a certain value.

However, this solution will not work. This is due to the fact that TVM does not release all the GPU memory and does not turn off the maximum performance mode after the inference has been completed. Because of what, the GPU continues to heat up even in those moments when inference does not occur. The release of resources occurs only after the main Python process, in which the measuring experiment was launched, is completed.

Expected behavior

After topology inference is complete, and all objects that could be handled by the GPU have been destructed, the TVM releases all GPU resources and puts the GPU into low power mode.

Actual behavior

TVM does not release all the GPU memory and does not turn off the maximum performance mode after the inference has been completed. The release of resources occurs only after the main Python process, in which the measuring experiment was launched, is completed.

Primary investigations

It should be said right away that these are not memory leaks in TVM. This is a feature of working with the CUDA Runtime API.

This is due to the fact that the CUDA Runtime API call on any thread which requires an active context will trigger the initialization of that device's primary context.

Primary context will remain active until they are explicitly deinitialized using cudaDeviceReset(). The function cudaDeviceReset() will deinitialize the primary context for the calling thread's current device immediately.

In view of this, in order to return the device to its original state, it is necessary to call the specified function after the end of the inference.

However, it seems that this cannot be done automatically (inside TVM) due to the following basic problems that may arise:

  • When working in multi-threaded mode, calling this function will reset the device for all threads. Therefore, all data on the GPU owned by other threads will be destroyed;

  • In single-threaded applications, resetting the GPU can clear data that is still needed to be used after the inference;

  • There may also be problems with delayed destruction of the object whose destructor should have a call to this function (when the Garbage Collector destroys the GPU using object after a new GPU using object has been created, in which case resetting the device in the first object will affect the second object).

Based on the problems described, we can conclude that this function call should not be placed inside a TVM. However, it is possible to add an additional global function (TVM_REGISTER_GLOBAL) to cuda_device_api.cc that will reset the GPU. This function will be called only in those cases when the user/programmer explicitly writes its call.

The implementation of the proposed idea might look like this:

TVM_REGISTER_GLOBAL("device_api.cuda_reset").set_body_typed([](int device_id) {
  CUDA_CALL(cudaSetDevice(device_id));
  CUDA_CALL(cudaDeviceReset());
});

And usage in Python code, for example:

reset_gpu = tvm.get_global_func("device_api.cuda_reset")
reset_gpu(0)

Environment

Key Value
GPU: Tesla T4 (16 GB)
CPU: Intel(R) Xeon(R) CPU @ 2.00GHz
System: Ubuntu 20.04.3 LTS
Target: x86_64-linux-gnu
CUDA: 11,1
LLVM: 12

Steps to reproduce

Run the script below and look at the state of the GPU, for example, using nvidia-smi.

import time
import numpy as np
from onnx import helper, checker, mapping

import tvm
from tvm import relay
from tvm.contrib import graph_executor


def get_two_input_model(op_name):
    in_shape = [1, 2, 3, 3]
    in_type = mapping.NP_TYPE_TO_TENSOR_TYPE[np.dtype("float32")]
    out_shape = in_shape
    out_type = in_type

    layer = helper.make_node(op_name, ["in1", "in2"], ["out"])
    graph = helper.make_graph(
        [layer],
        "two_input_test",
        inputs=[
            helper.make_tensor_value_info("in1", in_type, in_shape),
            helper.make_tensor_value_info("in2", in_type, in_shape),
        ],
        outputs=[
            helper.make_tensor_value_info(
                "out", out_type, out_shape
            )
        ],
    )
    model = helper.make_model(graph, producer_name="two_input_test")
    checker.check_model(model, full_check=True)
    return model


def generate_input_dict():
    input_dict = {}
    input_info = [
        {'inputName': 'in1', 'inputDtype': 'float32', 'inputShape': [1, 2, 3, 3]},
        {'inputName': 'in2', 'inputDtype': 'float32', 'inputShape': [1, 2, 3, 3]},
    ]
    for i in input_info:
        input_name = i["inputName"]
        input_shape = i["inputShape"]
        input_dtype = i["inputDtype"]
        input_dict[input_name] = tvm.nd.array(np.random.uniform(size=input_shape).astype(input_dtype), tvm.cuda(0))

    return input_dict


def compile(model, target, target_host, opt_level, opset, freeze_params):
    irmod, params = relay.frontend.from_onnx(model, opset=opset, freeze_params=freeze_params)
    with tvm.transform.PassContext(opt_level=opt_level):
        lib = relay.build(irmod, target=target, target_host=target_host, params=params)

    mod = graph_executor.GraphModule(lib["default"](tvm.cuda(0))).module

    set_input = mod.get_function('set_input')
    for inp_name, inp in generate_input_dict().items():
        set_input(inp_name, inp)


def main():
    onnx_model = get_two_input_model("Add")

    compile_options = dict(
        target="cuda",
        target_host="llvm -mtriple=x86_64-linux-gnu",
        opt_level=3,
        opset_version=onnx_model.opset_import[0].version,
        freeze_weights=True,
    )

    compile(
        onnx_model,
        compile_options["target"],
        compile_options["target_host"],
        compile_options["opt_level"],
        compile_options["opset_version"],
        compile_options["freeze_weights"],
    )


if __name__ == "__main__":
    main()
    print("*** The inference is complete, however our Python process is still holding resources on the GPU. ***")
    time.sleep(60)
@areusch areusch added the needs-triage PRs or issues that need to be investigated by maintainers to find the right assignees to address it label Oct 19, 2022
@hpanda-naut hpanda-naut added backend:cuda and removed needs-triage PRs or issues that need to be investigated by maintainers to find the right assignees to address it labels Nov 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants