Skip to content

Add multithreading test and put a lock on nvinfer1::createInferRuntime() for TRT EP #10714

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Mar 16, 2022

Conversation

chilo-ms
Copy link
Contributor

@chilo-ms chilo-ms commented Mar 1, 2022

Added multithreading test as well as put a lock around nvinfer1::createInferRuntime() call.
Without the lock, the test has the chances to encounter segfault or fail due to race conditions.

@chilo-ms chilo-ms requested review from stevenlix and jywu-msft March 1, 2022 20:00
static OrtMutex singleton;

// Acquire a lock only when force_sequential_engine_build_ is true;
return force_sequential_engine_build_ ? std::unique_lock<OrtMutex>(singleton) : std::unique_lock<OrtMutex>();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering whether to change to use unique_lock all the time? Since build engine API is not thread safe.
(Even though current multithreading test doesn't fail here, it's much safer to put lock all the time)

Copy link
Member

@jywu-msft jywu-msft Mar 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i think we should always lock when creating engine. at the very least, we don't want multiple threads building at the same time as it can perturb the tactics timings which can impact performance.
btw, i noticed the buildEngineWithConfig() api is deprecated starting TRT 8.0
it's replaced by buildSerializedNetwork()
at some point, we need to make a pass through trt api calls and update any deprecated api calls. +@stevenlix fyi

CreateBaseModel(model_name, graph_name, dims);

for (int i = 0; i < num_thread; ++i)
threads.push_back(std::thread(RunInference, model_name, sess_log_id));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

each thread is creating AND running the inference.
we also need to test the case where you create a single session and have multiple threads run inference for same session.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single session and multi-threads inference case is added.

@@ -87,6 +88,97 @@ void CreateBaseModel(std::string model_name, std::string graph_name, std::vector
status = onnxruntime::Model::Save(model, model_name);
}

void RunInference(std::string model_name, std::string sess_log_id) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we need to clarify with the naming that each thread is creating its own session and only 1 thread is running inference per session.
we need a multiple run, single session test as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename the function.

Otherwise, the constructed unique_lock is not associated with any mutex therefore no locking/unlocking will happen.
Get a unique_lock object to control the concurrency behavior.
Every api call not in the thread-safe operations(https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#threading)
shoud be protected by a lock for multi-threading situation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: shoud
rewrite multi-threading situation to 'when invoked by multiple threads concurrently'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified.

*/
std::unique_lock<OrtMutex> GetEngineBuildLock() const;
std::unique_lock<OrtMutex> GetLock() const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetLock is a bit too generic a name? maybe GetApiLock() or something like that, for enforcing serial ordering of tensorrt operation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified.

@@ -87,6 +88,190 @@ void CreateBaseModel(std::string model_name, std::string graph_name, std::vector
status = onnxruntime::Model::Save(model, model_name);
}

void RunSession(InferenceSession& session_object,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might consider in the future to migrate these to inference_session_test ?
these tests are basically generic and could be applied to all providers.
probably don't do it in this PR before testing whether all EP's can pass these tests.

@@ -250,6 +250,11 @@ TensorrtLogger& GetTensorrtLogger() {
return trt_logger;
}

std::unique_lock<OrtMutex> TensorrtExecutionProvider::GetApiLock() const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to be careful if we expand usage of this beyond the current few locking calls since it's singleton and shared across all sessions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep the lock on all the time, we should remove the TRT option force_sequential_engine_build, which provided an option to turn the lock off before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

be careful when removing options from structs and api's though.

Copy link
Contributor Author

@chilo-ms chilo-ms Mar 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

legacy OrtTensorRTProviderOptions struct: I don't think we can remove this field from the struct and build ORT directly. User application built with the old struct may fail to access force_sequential_engine_build field. We can disable this field instead?
new OrtTensorRTProviderOptionsV2: I think we can remove this field since it's a opaque struct. Probably need to provide warning information when user set force_sequential_engine_build through CreateTensorRTProviderOptions().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the option has become a no-op , so it's not as critical I think
we should remove it from our documentation though

jywu-msft
jywu-msft previously approved these changes Mar 4, 2022
@jywu-msft
Copy link
Member

@stevenlix do you have any comments for this PR?

@chilo-ms
Copy link
Contributor Author

I tested it with running BERT model and didn't see performance drop. So, we can make this PR merge to release 1.11.

@@ -1537,7 +1538,7 @@ common::Status TensorrtExecutionProvider::Compile(const std::vector<Node*>& fuse

// Build engine
{
auto lock = GetEngineBuildLock();
auto lock = GetApiLock();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering do we need a lock here? since the whole inference has been protected by a lock already.

std::lock_guard<OrtMutex> lock(*(trt_state->tensorrt_mu_ptr));

@@ -1537,7 +1538,6 @@ common::Status TensorrtExecutionProvider::Compile(const std::vector<Node*>& fuse

// Build engine
{
auto lock = GetEngineBuildLock();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the lock supposed to be removed here?

Copy link
Contributor Author

@chilo-ms chilo-ms Mar 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reminding and explanation. Yes, we shouldn't remove this lock here.
The inference lock "std::lock_guard lock(*(trt_state->tensorrt_mu_ptr))" in line1290 is per provider instance, and the API lock here is global singleton.
Consider the situation where it has multiple threads each creating session and running inference in the same process, then there can be overlapping engine builds from different threads in different sessions. So we still need a global singleton lock to prevent overlapping engine builds.

@chilo-ms chilo-ms merged commit 42d7112 into master Mar 16, 2022
@chilo-ms chilo-ms deleted the multi_threads_trt_test branch March 16, 2022 16:19
chilo-ms added a commit that referenced this pull request Mar 16, 2022
…e() for TRT EP (#10714)

* Add multithread unit test and put lock on library call

* update code

* remove debug code

* add comment

* add one session multi-threads inference

* Put lock for build engine all the time

* Update naming and comment

* remove unnecessary lock

* Revert "remove unnecessary lock"

This reverts commit 9c2317b.
chilo-ms added a commit that referenced this pull request Mar 18, 2022
* Update to flatbuffers v2.0.0 (#10866)

* Fix Reduced ops pipeline (#10861)

* Fix a couple of issues with the python package tools (#10858)

* Tweaks to the model utils
  * Add handling for a dim_value of -1 when replacing the entire input shape. This occurs in models exported from PaddlePaddle
  * make pytorch helpers accessible in package
  * make QDQ helpers accessible in package

* Fix wrong percentile values returned during calibration (#10847)

* Use numpy.percentile to get the lookup value.

* Use 1.0 as float value rather than integer.

* Add missing cdf parameter for `np.percentile`.

* Use 100. instead of 1.0

* Remove print.

* Update from @yufenglee

* Add support for opset 16 to transpose optimizer. (#10841)

* Add support for opset 16 to transpose optimizer.

Only change required is for GridSample to be added to the layout sensitive ops. The existing handling for layout transpose works with that as the first input and first output are layout sensitive.

Update the optimize to be able to return an error message if it fails.

* Use separate build directories for full and mobile iOS packages. (#10835)

* Address performance issue with abseil flat_hash_table. (#10819)

When returning by value in a cross DLL call, the hash table
even though containing all the entries that are originally there
can not find at least some of them. Reverting to std::unordered_set
pending further investigation.

* Mark end of version 11 C API. (#10803)

* Mark end of version 11 C API

* Add static_assert

* avoid using LocalFree on FormatMessageW buffer (#10796)

* remove local free

* Remove local free from onnxruntime

* don't allocate

* Change to use constexpr to satisfy  CPU build warning

* Integrate C-API tests into Pipelines for release packages (#10794)

* add c-api test for package

* fix bug for running c-api test for package

* refine run application script

* remove redundant code

* include CUDA test

* Remove testing CUDA EP temporarily

* fix bug

* Code refactor

* try to fix YAML bug

* try to fix YAML bug

* try to fix YAML bug

* fix bug for multiple directories in Pipelines

* fix bug

* add comments and fix bug

* Update c-api-noopenmp-packaging-pipelines.yml

* Remove failOnStandardError flag in Pipelines

* Detect runtime CUDA JIT and warn the user (#10781)

* Use cudaMalloc vs cudaDeviceSynchronize and show the total time

* Update convert_onnx_models_to_ort.py to support runtime optimizations. (#10765)

Add runtime optimization support to ONNX -> ORT format conversion script.
Replace `--optimization_level`, `--use_nnapi`, and `--use_coreml` with a new `--optimization_style` option.

* Add multithreading test and put a lock on nvinfer1::createInferRuntime() for TRT EP (#10714)

* Add multithread unit test and put lock on library call

* update code

* remove debug code

* add comment

* add one session multi-threads inference

* Put lock for build engine all the time

* Update naming and comment

* remove unnecessary lock

* Revert "remove unnecessary lock"

This reverts commit 9c2317b.

* Fix handling of nodes inserted by NHWC transformer. (#10904) (#10925)

* Revert "Upsample support NHWC (#10554)" (#10917)

This reverts commit bd08f11.

Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>

* [python API] Change raise import error when `C:\Windows\System32\vcruntime140_1.dll` is not found to warning (#10927)

* remove throw if C:\\Windows\\System32\\vcruntime140_1.dll cannot be found

* Add comments and update warning message

* adding back accidentally removed line

Co-authored-by: gwang0000 <62914304+gwang0000@users.noreply.github.com>

* [js] Create npm packaging pipeline (#10886)

* create npm packaging pipeline

* fix indentations

* Update npm-packaging-pipeline.yml for Azure Pipelines

* Update npm-packaging-pipeline.yml for Azure Pipelines

* Update npm-packaging-pipeline.yml for Azure Pipelines

* react-native-ci as a template

* fix typos

* fix template paths

* add a depencendy

* change a stage name

* set different artifact name for each package

* fix typo

* Update npm-packaging-pipeline.yml for Azure Pipelines

Set a build Id for node npm package as a parameter

* Update npm-packaging-pipeline.yml for Azure Pipelines

Set a build Id for node npm package as a parameter

* Update npm-packaging-pipeline.yml for Azure Pipelines

* Follow up update for python API checking if `vcruntime140_1.dll` is available (#10927) (#10933)

Co-authored-by: Hariharan Seshadri <hasesh@microsoft.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: Funtowicz Morgan <mfuntowicz@users.noreply.github.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com>
Co-authored-by: Pranav Sharma <prs@microsoft.com>
Co-authored-by: Ryan Lai <rylai@microsoft.com>
Co-authored-by: Ryan Hill <38674843+RyanUnderhill@users.noreply.github.com>
Co-authored-by: Yi-Hong Lyu <yilyu@microsoft.com>
Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>
Co-authored-by: Guoyu Wang <62914304+gwang-msft@users.noreply.github.com>
Co-authored-by: gwang0000 <62914304+gwang0000@users.noreply.github.com>
Co-authored-by: Sunghoon <35605090+hanbitmyths@users.noreply.github.com>
lavanyax pushed a commit to intel/onnxruntime that referenced this pull request Mar 29, 2022
…e() for TRT EP (microsoft#10714)

* Add multithread unit test and put lock on library call

* update code

* remove debug code

* add comment

* add one session multi-threads inference

* Put lock for build engine all the time

* Update naming and comment

* remove unnecessary lock

* Revert "remove unnecessary lock"

This reverts commit 9c2317b.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants